Hi We Just Got The Aws Autoscaler To Create A New Instance When You Enqueue A Task To The Relevant Queue. However, For Some Reason The Task Itself Is Never Run, It Stays In The Pending State. When Looking At The Worker Details, It Says "No Queues Curren

Answered

Hi we just got the AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue. However, for some reason the task itself is never run, it stays in the Pending state. When looking at the worker details, it says "No queues currently assigned to this worker" as below. What are we doing wrong? It seems that the worker is successfully created by the Autoscaler but then it doesn't know what queue to listen to for getting new tasks.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StraightSealion9
				
					0
					 × 1

Votes Newest

Answers 9

Sorry I've been away and just able to come back to this now. I'll run the scikit-learn job sample on my current setup and let you know if it comes back.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StraightSealion9
				
					0
					 × 1

@<1551376687504035840:profile|StraightSealion9> where do you see this error? Something if off here - is this a task running on an instance spun up by an AutoScaler application?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1523701087100473344:profile|SuccessfulKoala55> and @<1523701205467926528:profile|AgitatedDove14> sorry for the delay, I was travelling. Indeed I see that the instance is not booting, it's sort of stuck there so I'm going to first try and find an AMI / instance type that boots via the AWS console and try again via the AutoScaler. Thanks so much for the tips!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StraightSealion9
				
					0
					 × 1

Hi @<1551376687504035840:profile|StraightSealion9>

AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.

Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1551376687504035840:profile|StraightSealion9> it seems to me the AWS instance simply fails to start (and thus never pulls from the queue). Can you share the instance log?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

When looking at the worker details, it says "No queues currently assigned to this worker"

Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1551376687504035840:profile|StraightSealion9> , this is really strange - I don't think we ever return 429 from the .clear.ml servers...
@<1523701827080556544:profile|JuicyFox94> do you have any idea how this could happen?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

BTW my first test worked great on a CPU instance. Might have been a temporary issue.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StraightSealion9
				
					0
					 × 1

So I've gotten a little further by tweaking my VPC setup. I see that the autoscaler spins up new instances w/o a public ip so that means specifying a private subnet and having a NAT gateway on the subnet right? I tried running the scikit-learn joblib sample on a CPU instance, and the experiment definitely ran but I see another error. Could there be some permissions missing somewhere?

2023-04-03 15:01:00
2023-04-03 13:00:51,087 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/aws-autoscaler/ClearML Autoscaler.f302dff7cab7446f9bfeaf923622a2d0/artifacts/i-017eb35a20bd622b4/i-017eb35a20bd622b4.txt (429): 
2023-04-03 13:00:51,087 - clearml.metrics - WARNING - Failed uploading to

 (Failed uploading object /_ApplicationInstances/aws-autoscaler/ClearML Autoscaler.f302dff7cab7446f9bfeaf923622a2d0/artifacts/i-017eb35a20bd622b4/i-017eb35a20bd622b4.txt (429): )
2023-04-03 13:00:51,087 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StraightSealion9
				
					0
					 × 1

Write your answer

2K Views

9 Answers

2 years ago