Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
A Question Regarding Using

A question regarding using clearml-agent with k8s clusters. We use ClearML pipelines to train our models. The pods sometimes fail due to intermittent failures (OOM, network, etc.), but this is not visible in the ClearML UI, rather the status is just "failed" with no further information. So our data scientists have to go to DevOps and MLOps engineers to track down what happened to their pods. To make it worse, the clearml-agent deletes completed pods immediately, making it impossible to debug what exactly happened sometimes. Do you have any ideas how to handle these cases better? How can we improve visibility/monitoring for these cases? I guess the clearml-agent count report more information about failed pods.

  
  
Posted one month ago
Votes Newest

Answers 6


Hi @<1798887585121046528:profile|WobblyFrog79> , don't the logs in the task show some sort of error?

  
  
Posted one month ago

I guess when the pods simply crash or disconnect, the clearml agent won't have a chance to report to ClearML server: hey, the network is going to be cut ....
You will need to k8s logic to flow back to the DS that the node just die for xyz reason ...

  
  
Posted one month ago

@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit

  
  
Posted one month ago

Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.

  
  
Posted one month ago

I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.

  
  
Posted one month ago

@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.

  
  
Posted one month ago
361 Views
6 Answers
one month ago
one month ago
Tags