the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached
yes, we realized that later. this synchronous pause is enough to 4x the training time for this model. for a logging library, I think it's fair to call that catastrophic...
what would be the impact if we changed the flush logic to instead return()
instead of sleep(0.1)
? can the queue have arbitrarily many events in its cache without failing?
DepressedMonkey10 catastrophic seems quite a drastic characterization - as explained in the github issue, the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached
Is it possible that your training steps are that inefficient?
I don't think this is a fair description of the events. our training code is efficient; clear ml is not. we get north of 650 tflops, which is well beyond what most organizations report
sdk.development.report_use_subprocess = False
Time: 0.00073 seconds (before, was 6.5 ms)
this mode is 9x faster. I assume entirely from skipping the _fast_is_subprocess_alive
check
if this function is taking a unusual amount of time compared to your expectations, the system we're using is CW's standard HGX node, with a Xeon Platinum 8462Y+
, 2 TB ram
most of the remaining overhead comes from ev = ScalarEvent(...)
. I'll look into that later. 3.6 ms out of 5.7 ms total, for a normal-sized model
the existing _touch_title_series
is faster than my proposed get() method. benchmark script:
the CPU and RAM are also not important to us, only the synchronous blocking time
removing self._start_task_if_needed()
gives a tiny speedup, 0.5%
I'll try the config changes you suggested in github, thanks
would it be convenient if I waited until Monday to push further details? after your fix, the remaining stuff is not pressing, so I would prefer not to bother you folks in off-business hours
I tested how many resources ClearML consumes. The entire process of ClearML SDK consumes about 50mb of RAM memory on my side and it requires minimal amount of CPU.
Is it possible that your training steps are that inefficient?