This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless retries ...
Does that make sense ?
Now for example the pod was killed because I had to replace the node. The task is stuck in "Running".
There is a "timer" in the backend (default 2 hours) that if a Task is marked running but does not "ping" (i.e. a live) i will set it to aborted.