Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
A Question Regarding Using

A question regarding using clearml-agent with k8s clusters. We use ClearML pipelines to train our models. The pods sometimes fail due to intermittent failures (OOM, network, etc.), but this is not visible in the ClearML UI, rather the status is just "failed" with no further information. So our data scientists have to go to DevOps and MLOps engineers to track down what happened to their pods. To make it worse, the clearml-agent deletes completed pods immediately, making it impossible to debug what exactly happened sometimes. Do you have any ideas how to handle these cases better? How can we improve visibility/monitoring for these cases? I guess the clearml-agent count report more information about failed pods.

  
  
Posted 2 months ago
Votes Newest

Answers 6


Hi @<1798887585121046528:profile|WobblyFrog79> , don't the logs in the task show some sort of error?

  
  
Posted 2 months ago

@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.

  
  
Posted 2 months ago

@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit

  
  
Posted 2 months ago

I guess when the pods simply crash or disconnect, the clearml agent won't have a chance to report to ClearML server: hey, the network is going to be cut ....
You will need to k8s logic to flow back to the DS that the node just die for xyz reason ...

  
  
Posted 2 months ago

I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.

  
  
Posted 2 months ago

Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.

  
  
Posted 2 months ago
775 Views
6 Answers
2 months ago
2 months ago
Tags