Hey Guys, I Am Trying To Plan What I Need To Do In Order To Efficiently Use Clearml With Spot Instances 1) Detecting When Spot Instance Is Down And Experiment Is Aborted 2) Extracting S3 Address Of The Latest Checkpoint From Clearml Api 3) Starting New E

Answered

hey guys, I am trying to plan what I need to do in order to efficiently use ClearML with spot instances

detecting when spot instance is down and experiment is aborted
extracting S3 address of the latest checkpoint from ClearML API
starting new experiment with this address as an argument
merging aborted and new experiment, so that we can see all graphs and metrics nicely on one page

1-3 seems more or less straightforward, but what about 4? anybody has an example code of how you would go around merging two experiments (aborted and restarted)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Votes Newest

Answers 5

Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume , and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clear.ml/docs/latest/docs/references/sdk/task#set_initial_iteration to prevent the task to overwrite the already logged iterations (ClearML should detect and handle it automatically, but it wasn’t the case for me)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

nice! exactly what I need, thank you!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

we use task.models[] 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Very Cool!
BTW guys, are you using the task.models[] to continue from the last checkpoint? or is it task.artifacts[] ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

JitteryCoyote63 how do you detect spot interruption is coming from within the http://clear.ml task in time to mark it as “resume”?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Write your answer

2K Views

5 Answers

4 years ago

2 years ago