Badges 138 × Eureka!
CostlyOstrich36 it works like this
Here is what I see as the ideal scenario:
If a worker pod running a task dies for any reason, clearml should mark the task as failed / aborted asap. Basically improve the feedback loop. Tasks running as services should be re-enqueued automatically if a the pod it runs on dies because of OOM, node eviction, node replacement, pod replacement because of autoscaling etc. You could argue the same for tasks which are not services. Restart them if their pod dies for the above reasons.
Ah I did not think to look for that option in the user's settings. That should do it. Thank you for the help 🙂
ok, i'll try to fix the connection issue. Thank you for the help 🙂
Now for example the pod was killed because I had to replace the node. The task is stuck in "Running". Aborting from the UI says "experiment aborted successfully" but the state does not change.
I will try to fix that. But what is the purpose of the 'k8s_scheduler' queue?
With pipelines is even more complicated because what I experienced is that the pod for step 2 was evicted because it was eating too much memory. So the pod has been terminated but the task was not marked as failed / aborted. Because of that, the pipeline controller pod was still running and the pipeline itself was also not marked as aborted / failed.
SmugDolphin23 ok so
step_task_completed_callback does indeed work because
step_task_completed_callback runs before the task is executed.
step_task_created_callback seems to run after the task is executed however so the naming seems to be reversed.
pre_execute_callback from the
pipe.add_function_step needs to be fixed, it does run before the task is executed but the Node does not have any attributes set besides the name.
For a bit more context. Let's say I have 2 experiments in "Project MLOps" called "Exp 1" and "Exp 2". When I publish "Exp 2" I want this trigger to pick up that event and start another task in some other project. But this task would need some information about "Exp 2" like it's name, id or maybe config object etc.
Does the trigger pass any context to the task which will be executed?
AgitatedDove14 Thank you for the info. I will try it out.
SuccessfulKoala55 So this is the intended behavior? To always have to select the queue from "Advanced configuration" on the pipeline run window even though the "set_default_execution_queue" is set to the "default" queue?
Besides the fact that tasks will always have "k8s_scheduler" as the queue in the info tab so looking back at a task you will not be able to tell to which queue it was assigned.
JuicyFox94 since I have you, the connection issue might be caused by the istio proxy. In order to disable the istio sidecar injection I must add an annotation to the pod.
Unfortunately there does not seem to be any field for that in the values file.
Hi WackyRabbit7 . Take a look at https://clear.ml/docs/latest/docs/references/sdk/task#taskget_task
I believe it describes your use case as example.
What I would like to be able to do is basically get rid of the ".pipelines" project that gets created automatically
Not sure, I have not tried it myself. Give it a go and see how it behaves.
I am trying to run with scale from zero k8s nodes for maximum cost savings. So a node should only be online if clearml actually runs a task. Waiting for the 2 hours timeout when running on expensive gpu instances for example is quite wasteful because the pipeline controller pod will keep the node online.
Alright. I will keep it in mind. Thank you for the confirmation 🙂
actually it does not because the pods logs show .
No problem SmugDolphin23 and thank you. I am really quite stuck with this 😄
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
Another parameter for when the task is deleted might also be useful
I do believe triggers should be unique somehow because I find them way too easy to mishandle. Especially if used with
schedule_function which is defined in the same script. Updating that function requires deleting the existing trigger task first and recreating it. If not done like this you just end up with 2 trigger tasks with the same name which I assume will respond to the same event(s) but do something slightly different in response. I assume it might work like this...
Thank you for the reply SmugDolphin23
Is there any possible workaround at the moment?
Thank you SmugDolphin23 I'll try it out.
The alternative I can think of is to implement a clearml Monitor