A Question Regarding Using

Answered

A question regarding using clearml-agent with k8s clusters. We use ClearML pipelines to train our models. The pods sometimes fail due to intermittent failures (OOM, network, etc.), but this is not visible in the ClearML UI, rather the status is just "failed" with no further information. So our data scientists have to go to DevOps and MLOps engineers to track down what happened to their pods. To make it worse, the clearml-agent deletes completed pods immediately, making it impossible to debug what exactly happened sometimes. Do you have any ideas how to handle these cases better? How can we improve visibility/monitoring for these cases? I guess the clearml-agent count report more information about failed pods.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Votes Newest

Answers 6

@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Hi @<1798887585121046528:profile|WobblyFrog79> , don't the logs in the task show some sort of error?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I guess when the pods simply crash or disconnect, the clearml agent won't have a chance to report to ClearML server: hey, the network is going to be cut ....
You will need to k8s logic to flow back to the DS that the node just die for xyz reason ...

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Write your answer

1K Views

6 Answers

4 months ago