Very Cool!
BTW guys, are you using the task.models[]
to continue from the last checkpoint? or is it task.artifacts[]
?
nice! exactly what I need, thank you!
JitteryCoyote63 how do you detect spot interruption is coming from within the http://clear.ml task in time to mark it as “resume”?
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clear.ml/docs/latest/docs/references/sdk/task#set_initial_iteration to prevent the task to overwrite the already logged iterations (ClearML should detect and handle it automatically, but it wasn’t the case for me)