Hello. I Have An Issue In Regards To A Task That I Run As A Service ( Should Always Run). I Run The Clearml Server And Agents In Kubernetes. I Think This Is A Design Problem With The Way Clearml Agents Run On Kubernetes. The K8S Glue Will Launch A Worker

Answered

Hello. I have an issue in regards to a task that I run as a service ( should always run). I run the ClearML server and agents in kubernetes.

I think this is a design problem with the way clearml agents run on kubernetes. The k8s glue will launch a worker pod for each task. This means that if something happens with the k8s node the pod runs on, the pod will be terminated and the clearml task aborted. The pod will not be recreated because it is just a pod resource, it is not controlled by a deployment resource. The clearml task will also not be automatically re-enqueued.

I think this is a problem especially for tasks that should run as a service. You should not have to go in and re-enqueue all the tasks that should run as a service if a kubernetes node dies / need to be replaced etc. I imagine the easiest solution would be to automatically re-enqueue service tasks if they have been aborted because their worker is no longer reachable / has been terminated.

Any thoughts ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

Votes Newest

Answers 7

Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

This means that if something happens with the k8s node the pod runs on,

Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?

I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.

From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless retries ...
Does that make sense ?

Now for example the pod was killed because I had to replace the node. The task is stuck in "Running".

There is a "timer" in the backend (default 2 hours) that if a Task is marked running but does not "ping" (i.e. a live) i will set it to aborted.

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

From k8s perspective a pod is ephemeral so if it’s gone for any reason it’s gone. Obviously there are structures that can ensure running state (like Deployments or Statefulsets) so if a pod dies, another one takes place. We didn;t go in this direction because pods are not idempotent so it’s not straightfoward to simply replace them. Btw this looks an interesting topic to me so I’d like to include SuccessfulKoala55 on this also because i’m involved more in infra side of the equation and I may miss something here.

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

With pipelines is even more complicated because what I experienced is that the pod for step 2 was evicted because it was eating too much memory. So the pod has been terminated but the task was not marked as failed / aborted. Because of that, the pipeline controller pod was still running and the pipeline itself was also not marked as aborted / failed.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

Now for example the pod was killed because I had to replace the node. The task is stuck in "Running". Aborting from the UI says "experiment aborted successfully" but the state does not change.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

I am trying to run with scale from zero k8s nodes for maximum cost savings. So a node should only be online if clearml actually runs a task. Waiting for the 2 hours timeout when running on expensive gpu instances for example is quite wasteful because the pipeline controller pod will keep the node online.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

Write your answer

2K Views

7 Answers

2 years ago