Hi @<1798887585121046528:profile|WobblyFrog79> , don't the logs in the task show some sort of error?
I guess when the pods simply crash or disconnect, the clearml agent won't have a chance to report to ClearML server: hey, the network is going to be cut ....
You will need to k8s logic to flow back to the DS that the node just die for xyz reason ...
@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit
Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.
I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.
@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.