Well, if you have any relevant debugging info I would appreciate it, or any hints on how to reproduce 🙂
Well, the agent is supposed to kill the task's process - didn't it?
Yeah GPU utilization was 100% . I cleaned it up using
nvidia-smi and killing the process. But i was expecting the clean up to happen automatically since the process failed.
Hi ObedientToad56 , I guess somehow the training code left the GPU resources in an unstable state? Is the worker currently running anything?
no , it didn't kill the process.
sure Thanks SuccessfulKoala55 Not sure if is a one off event. I will try to reproduce it.