Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

Answered

has anyone else benchmarked ClearML? I'm seeing catastrophic logging overhead: None
for 50 numbers, ClearML's logging takes 3x as long as my entire LLM training step
even when the spikes don't occur, ClearML is 17% of the step time, 6x-16x slower than other logging services

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

Votes Newest

Answers 12

the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached

yes, we realized that later. this synchronous pause is enough to 4x the training time for this model. for a logging library, I think it's fair to call that catastrophic...
what would be the impact if we changed the flush logic to instead return() instead of sleep(0.1) ? can the queue have arbitrarily many events in its cache without failing?

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

DepressedMonkey10 catastrophic seems quite a drastic characterization - as explained in the github issue, the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached

  				
Posted 
	one month ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Is it possible that your training steps are that inefficient?

I don't think this is a fair description of the events. our training code is efficient; clear ml is not. we get north of 650 tflops, which is well beyond what most organizations report

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

sdk.development.report_use_subprocess = False
Time: 0.00073 seconds (before, was 6.5 ms)
this mode is 9x faster. I assume entirely from skipping the _fast_is_subprocess_alive check

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

if this function is taking a unusual amount of time compared to your expectations, the system we're using is CW's standard HGX node, with a Xeon Platinum 8462Y+ , 2 TB ram

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

most of the remaining overhead comes from ev = ScalarEvent(...) . I'll look into that later. 3.6 ms out of 5.7 ms total, for a normal-sized model

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

the existing _touch_title_series is faster than my proposed get() method. benchmark script:

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

the CPU and RAM are also not important to us, only the synchronous blocking time

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

removing self._start_task_if_needed() gives a tiny speedup, 0.5%

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

I'll try the config changes you suggested in github, thanks

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

would it be convenient if I waited until Monday to push further details? after your fix, the remaining stuff is not pressing, so I would prefer not to bother you folks in off-business hours

  				
Posted 
	one month ago

					More  		
  Report
		
					DepressedMonkey10
				
					0
					 × 1

I tested how many resources ClearML consumes. The entire process of ClearML SDK consumes about 50mb of RAM memory on my side and it requires minimal amount of CPU.

Is it possible that your training steps are that inefficient?

  				
Posted 
	one month ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

146 Views

12 Answers

one month ago