Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Has Anyone Else Benchmarked Clearml? I'M Seeing Catastrophic Logging Overhead:

has anyone else benchmarked ClearML? I'm seeing catastrophic logging overhead: None
for 50 numbers, ClearML's logging takes 3x as long as my entire LLM training step
even when the spikes don't occur, ClearML is 17% of the step time, 6x-16x slower than other logging services

  
  
Posted one month ago
Votes Newest

Answers 12


DepressedMonkey10 catastrophic seems quite a drastic characterization - as explained in the github issue, the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached

  
  
Posted one month ago

Is it possible that your training steps are that inefficient?

I don't think this is a fair description of the events. our training code is efficient; clear ml is not. we get north of 650 tflops, which is well beyond what most organizations report

  
  
Posted one month ago

if this function is taking a unusual amount of time compared to your expectations, the system we're using is CW's standard HGX node, with a Xeon Platinum 8462Y+ , 2 TB ram

  
  
Posted one month ago

most of the remaining overhead comes from ev = ScalarEvent(...) . I'll look into that later. 3.6 ms out of 5.7 ms total, for a normal-sized model

  
  
Posted one month ago

the CPU and RAM are also not important to us, only the synchronous blocking time

  
  
Posted one month ago

would it be convenient if I waited until Monday to push further details? after your fix, the remaining stuff is not pressing, so I would prefer not to bother you folks in off-business hours

  
  
Posted one month ago

I tested how many resources ClearML consumes. The entire process of ClearML SDK consumes about 50mb of RAM memory on my side and it requires minimal amount of CPU.

Is it possible that your training steps are that inefficient?

  
  
Posted one month ago

sdk.development.report_use_subprocess = False
Time: 0.00073 seconds (before, was 6.5 ms)
this mode is 9x faster. I assume entirely from skipping the _fast_is_subprocess_alive check

  
  
Posted one month ago

the existing _touch_title_series is faster than my proposed get() method. benchmark script:

  
  
Posted one month ago

removing self._start_task_if_needed() gives a tiny speedup, 0.5%

  
  
Posted one month ago

the "spike" is not a spike, it's a simple cache mechanism that is designed to reduce API calls and sends an API request once 100 events are cached

yes, we realized that later. this synchronous pause is enough to 4x the training time for this model. for a logging library, I think it's fair to call that catastrophic...
what would be the impact if we changed the flush logic to instead return() instead of sleep(0.1) ? can the queue have arbitrarily many events in its cache without failing?

  
  
Posted one month ago

I'll try the config changes you suggested in github, thanks

  
  
Posted one month ago
148 Views
12 Answers
one month ago
one month ago
Tags