Unanswered
Hello. I Have An Issue In Regards To A Task That I Run As A Service ( Should Always Run). I Run The Clearml Server And Agents In Kubernetes.
I Think This Is A Design Problem With The Way Clearml Agents Run On Kubernetes. The K8S Glue Will Launch A Worker
Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.
192 Views
0
Answers
2 years ago
one year ago