Unanswered
Hi,
I'M Using Clearml'S Hosted Free Saas Offering.
I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As
AnxiousSeal95 I just checked and Hydra returns an exit code of 1
to mark the failure as does another toy program which just throws an exception. So my guess is CML is not using the exit code as a means to determine when the task failed. Are you able to share how CML determines when a task failed? If you could point me to the relevant code files, I'm happy to dive in and figure it out.
160 Views
0
Answers
2 years ago
one year ago