Hi, Did Anyone Experiment With Running On The Aws Autoscaler On Spots And Knows Whether There Is Configuration For Retry Policy When Spot Get Evacuated Mid-Job?

Answered

Hi, did anyone experiment with running on the aws autoscaler on spots and knows whether there is configuration for retry policy when spot get evacuated mid-job?

  				
Posted 
	3 years ago

					More  		
  Report
		
					WittyCow34
				
					0
					 × 1

Votes Newest

Answers 3

Thank you very much! JitteryCoyote63

  				
Posted 
	3 years ago

					More  		
  Report
		
					WittyCow34
				
					0
					 × 1

Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi Itay! I know JitteryCoyote63 played with it a bit, I'm not sure what was his ultimate conclusion 🙂
We are now working adding such feature to ClearML-Pro (That is soon-to-be released), I suggest to stay tuned 😄

  				
Posted 
	3 years ago

					More  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Write your answer

2K Views

3 Answers

3 years ago

2 years ago