Hey, I Want To Use The Aws Autoscaler With Spot Instances, And I Was Wondering How (Or If) You Handle Interruptions. What We Currently Implemented Is A Mechanism That On Spot Failure Reruns The Training With A Flag, And Our Code Knows To Search For The La

Answered

Hey, I want to use the AWS autoscaler with spot instances, and I was wondering how (or if) you handle interruptions. What we currently implemented is a mechanism that on spot failure reruns the training with a flag, and our code knows to search for the latest checkpoint and resume from it. But this, of course, is not on ClearML. Do you have any handling, or any way we can connect the two systems?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CleanPigeon16
				
					0
					 × 1

Votes Newest

Answers 3

yeah, totally. Are there any services OOB like this?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CleanPigeon16
				
					0
					 × 1

Hi CleanPigeon16

I was wondering how (or if) you handle interruptions.

Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in your case is a service that checks that a Task was aborted and then re-enqueues it to the same queue (which will trigger the auto scaler to spin a new instance if needed)
Make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Are there any services OOB like this?

On the open-source, I can't recall any but will probably be easy to write. Paid tier might have an offering though, not sure 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

3 Answers

3 years ago

2 years ago