Reputation
Badges 1
38 × Eureka!CostlyOstrich36 it works like this
TimelyMouse69 The pipeline task(s) end up in a sub project called ".pipelines" no matter how I configure the PipelineController project name and target project. This .pipelines project is not visible from the "PROJECTS" section of the UI. You can only get to it from the PIPELINES view by clicking on "Full details" on a step.
Please see attached images
I will try to fix that. But what is the purpose of the 'k8s_scheduler' queue?
Now for example the pod was killed because I had to replace the node. The task is stuck in "Running". Aborting from the UI says "experiment aborted successfully" but the state does not change.
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
actually it does not because the pods logs show .
With pipelines is even more complicated because what I experienced is that the pod for step 2 was evicted because it was eating too much memory. So the pod has been terminated but the task was not marked as failed / aborted. Because of that, the pipeline controller pod was still running and the pipeline itself was also not marked as aborted / failed.
I am trying to run with scale from zero k8s nodes for maximum cost savings. So a node should only be online if clearml actually runs a task. Waiting for the 2 hours timeout when running on expensive gpu instances for example is quite wasteful because the pipeline controller pod will keep the node online.
Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.