Hello. I Have An Issue In Regards To A Task That I Run As A Service ( Should Always Run). I Run The Clearml Server And Agents In Kubernetes. I Think This Is A Design Problem With The Way Clearml Agents Run On Kubernetes. The K8S Glue Will Launch A Worker

Unanswered

This means that if something happens with the k8s node the pod runs on,

Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?

I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.

From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless retries ...
Does that make sense ?

Now for example the pod was killed because I had to replace the node. The task is stuck in "Running".

There is a "timer" in the backend (default 2 hours) that if a Task is marked running but does not "ping" (i.e. a live) i will set it to aborted.

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

270 Views

0 Answers

2 years ago