Unanswered
Hey Guys, I Am Trying To Plan What I Need To Do In Order To Efficiently Use Clearml With Spot Instances
1) Detecting When Spot Instance Is Down And Experiment Is Aborted
2) Extracting S3 Address Of The Latest Checkpoint From Clearml Api
3) Starting New E
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clear.ml/docs/latest/docs/references/sdk/task#set_initial_iteration to prevent the task to overwrite the already logged iterations (ClearML should detect and handle it automatically, but it wasn’t the case for me)
195 Views
0
Answers
3 years ago
one year ago