Reputation
Badges 1
38 × Eureka!With pipelines is even more complicated because what I experienced is that the pod for step 2 was evicted because it was eating too much memory. So the pod has been terminated but the task was not marked as failed / aborted. Because of that, the pipeline controller pod was still running and the pipeline itself was also not marked as aborted / failed.
But the pre_execute_callback
from the pipe.add_function_step
needs to be fixed, it does run before the task is executed but the Node does not have any attributes set besides the name.
SmugDolphin23 ok so pipe.start
with step_task_completed_callback
does indeed work because step_task_completed_callback
runs before the task is executed. step_task_created_callback
seems to run after the task is executed however so the naming seems to be reversed.
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
Thank you SmugDolphin23 I'll try it out.
Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.
No problem SmugDolphin23 and thank you. I am really quite stuck with this 😄
actually it does not because the pods logs show .
If I right click on the initial pipeline Draft and hit "Run" from there, the new run wizard is populated with the default parameters value and uses "set_default_execution_queue" as the queue under "Advanced configuration".