DepressedMonkey10

2 Questions, 12 Answers

Active since 23 February 2025

Last activity 9 months ago

Reputation

Badges 1

12 × Eureka!

Questions 2
Answers 12

0 Votes

3 Answers

592 Views

0 Votes 3 Answers 592 Views

My Linter (The Default Vscode Python Extension) Doesn'T Understand

my linter (the default vscode python extension) doesn't understand TaskInstance = TypeVar("TaskInstance", bound="Task") . if I type clearml.Task.current_task...

clearml

9 months ago

0 Votes

12 Answers

785 Views

0 Votes 12 Answers 785 Views

Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

has anyone else benchmarked ClearML? I'm seeing catastrophic logging overhead: None for 50 numbers, ClearML's logging takes 3x as long as my entire LLM train...

clearml

9 months ago

0 My Linter (The Default Vscode Python Extension) Doesn'T Understand

this is an example of lint not working. setting the name is not the concern

9 months ago

0 My Linter (The Default Vscode Python Extension) Doesn'T Understand

this issue is minor though. even without lint suggestions, everything works fine, it's just a bit slower to work with. since my environment is vanilla, I expect many other people to experience the same thing, unless there is something I'm doing wrong in particular

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

removing self._start_task_if_needed() gives a tiny speedup, 0.5%

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

sdk.development.report_use_subprocess = False
Time: 0.00073 seconds (before, was 6.5 ms)
this mode is 9x faster. I assume entirely from skipping the _fast_is_subprocess_alive check

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

most of the remaining overhead comes from ev = ScalarEvent(...) . I'll look into that later. 3.6 ms out of 5.7 ms total, for a normal-sized model

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

if this function is taking a unusual amount of time compared to your expectations, the system we're using is CW's standard HGX node, with a Xeon Platinum 8462Y+ , 2 TB ram

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

the CPU and RAM are also not important to us, only the synchronous blocking time

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

I'll try the config changes you suggested in github, thanks

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

would it be convenient if I waited until Monday to push further details? after your fix, the remaining stuff is not pressing, so I would prefer not to bother you folks in off-business hours

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

the existing _touch_title_series is faster than my proposed get() method. benchmark script:

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached

yes, we realized that later. this synchronous pause is enough to 4x the training time for this model. for a logging library, I think it's fair to call that catastrophic...
what would be the impact if we changed the flush logic to instead return() instead of sleep(0.1) ? can the queue have arbitrarily many events in its cache without...

9 months ago

0 Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

Is it possible that your training steps are that inefficient?

I don't think this is a fair description of the events. our training code is efficient; clear ml is not. we get north of 650 tflops, which is well beyond what most organizations report

9 months ago