JitteryCoyote63 how do you detect spot interruption is coming from within the http://clear.ml task in time to mark it as “resume”?
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clear.ml/docs/latest/docs/references/sdk/task#set_initial_iteration to prevent the task to overwrite the already logged iterations (ClearML should detect and handle it automatically, but it wasn’t the case for me)
Very Cool!
BTW guys, are you using the task.models[]
to continue from the last checkpoint? or is it task.artifacts[]
?
nice! exactly what I need, thank you!