Hello. I Have An Issue In Regards To A Task That I Run As A Service ( Should Always Run). I Run The Clearml Server And Agents In Kubernetes. I Think This Is A Design Problem With The Way Clearml Agents Run On Kubernetes. The K8S Glue Will Launch A Worker

Unanswered

Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					DangerousDragonfly8
				
					0
					 × 1

428 Views

0 Answers

2 years ago