Hi Itay! I know JitteryCoyote63 played with it a bit, I'm not sure what was his ultimate conclusion 🙂
We are now working adding such feature to ClearML-Pro (That is soon-to-be released), I suggest to stay tuned 😄
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint