I Encountered A Weird Edge Case With The Aws Auto-Scaler, Wondering If There Are Any Solutions Or If This Is A Known Issue. Something As Follows Happened:

Answered

I encountered a weird edge case with the AWS Auto-scaler, wondering if there are any solutions or if this is a known issue.
Something as follows happened:
The queue is empty, instance A was discovered as idle, and was spun down. While it is spinning down, it is still marked as an idle worker by ClearML. During this time, a task came up to the queue. Since there is an idle worker, the autoscaler attempts to use it (?) and can't proceed After some minutes, instance A is finally terminated and removed from ClearML "idle workers" list. Autoscaler now spins up a new instance
Seems like once the instruction to spin down an instance is given, the worker should no longer be discovered and/or interacted with.
Has anyone encountered this?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Votes Newest

Answers 6

CostlyOstrich36 I'm not sure what is holding it from spinning down. Unfortunately I was not around when this happened. Maybe it was AWS taking a while to terminate, or maybe it was just taking a while to register in the autoscaler.

The logs looked like this:

Recognizing an idle worker and spinning down.
2022-09-19 12:27:33,197 - clearml.auto_scaler - INFO - Spin down instance cloud id 'i-058730639c72f91e1'2. Recognizing a new task is available, but the worker is still idle.
2022-09-19 12:32:35,698 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:32:35,816 - clearml.auto_scaler - INFO - idle worker: {'dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1': (1663590436.5344, 'c5n_4xl', <Worker: id=dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1>)}3. A few minutes later, the task is still queued, the idle worker is still active (we have a budget of 6 AWS instances on this aws queue):
2022-09-19 12:36:37,860 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:36:37,973 - clearml.auto_scaler - INFO - idle worker: {'dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1': (1663590436.5344, 'c5n_4xl', <Worker: id=dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1>)}4. A minute later, the idle worker finally shuts down and disappears from the idle worker list, and a new instance is spun up:
2022-09-19 12:37:38,389 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:37:38,506 - clearml.auto_scaler - INFO - Spinning new instance resource='c5n_4xl', prefix='dynamic_worker', queue='aws'

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

UnevenDolphin73 , that's an interesting case. I'll see if I can reproduce it as well. Also can you please clarify step 4 a bit? Also on step 5 - what is "holding" it from spinning down?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

The instance that took a while to terminate (or has taken a while to disappear from the idle workers)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

UnevenDolphin73 that s seems to be an issue with the instance shutting down, the autoscaler's behaviour seems normal. Can you try to get the system log for the instance? Maybe there will be some clues there...

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I cannot, the instance is long gone... But it's not different to any other scaled instances, it seems it just took a while to register in ClearML

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

You mean the new instance?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

6 Answers

2 years ago

one year ago