Reputation
Badges 1
38 × Eureka!SmugDolphin23 ok so pipe.start
with step_task_completed_callback
does indeed work because step_task_completed_callback
runs before the task is executed. step_task_created_callback
seems to run after the task is executed however so the naming seems to be reversed.
AgitatedDove14
I do believe triggers should be unique somehow because I find them way too easy to mishandle. Especially if used with schedule_function
which is defined in the same script. Updating that function requires deleting the existing trigger task first and recreating it. If not done like this you just end up with 2 trigger tasks with the same name which I assume will respond to the same event(s) but do something slightly different in response. I assume it might work like this...
Thank you SmugDolphin23 I'll try it out.
Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.
Now for example the pod was killed because I had to replace the node. The task is stuck in "Running". Aborting from the UI says "experiment aborted successfully" but the state does not change.
With pipelines is even more complicated because what I experienced is that the pod for step 2 was evicted because it was eating too much memory. So the pod has been terminated but the task was not marked as failed / aborted. Because of that, the pipeline controller pod was still running and the pipeline itself was also not marked as aborted / failed.
I am trying to run with scale from zero k8s nodes for maximum cost savings. So a node should only be online if clearml actually runs a task. Waiting for the 2 hours timeout when running on expensive gpu instances for example is quite wasteful because the pipeline controller pod will keep the node online.
Hi WackyRabbit7 . Take a look at https://clear.ml/docs/latest/docs/references/sdk/task#taskget_task
I believe it describes your use case as example.
CostlyOstrich36 it works like this