Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
How Does Clearml Handle

How does ClearML handle running jobs while server is down?
We saw some mismatches between what ClearML logged with the actual iteration, in time periods while clearml-server was down
We run distributed training using pytorch, It looks the jobs died with watchdog timer and did not continue to next epoch while server was down (for ~3 hours for backup purposes)
Is there a chance when job failed to report to ClearML - it does continue to next epoch, does not consume any utilization which result in pytorch killing the job?

  
  
Posted 3 months ago
Votes Newest

Answers 2


Hi @<1523701842515595264:profile|PleasantOwl46> , I think that is what happening. If server is down, code continues running as if nothing happened and ClearML will simply cache all results and flush them once server is back up

  
  
Posted 3 months ago

@<1523701070390366208:profile|CostlyOstrich36> unfortunately, this is not the behavior we are seeing
same exact issue happen tonight
on epoch number 53 ClearML were shut down, the job did not continue to epoch 54 and eventually got killed with watchdog timer

  
  
Posted 3 months ago
258 Views
2 Answers
3 months ago
3 months ago
Tags