Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi We Just Got The Aws Autoscaler To Create A New Instance When You Enqueue A Task To The Relevant Queue. However, For Some Reason The Task Itself Is Never Run, It Stays In The Pending State. When Looking At The Worker Details, It Says "No Queues Curren

Hi we just got the AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue. However, for some reason the task itself is never run, it stays in the Pending state. When looking at the worker details, it says "No queues currently assigned to this worker" as below. What are we doing wrong? It seems that the worker is successfully created by the Autoscaler but then it doesn't know what queue to listen to for getting new tasks.
image

  
  
Posted one year ago
Votes Newest

Answers 9


When looking at the worker details, it says "No queues currently assigned to this worker"

Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?

  
  
Posted one year ago

Hi @<1551376687504035840:profile|StraightSealion9>

AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.

Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?

  
  
Posted one year ago

Hi @<1551376687504035840:profile|StraightSealion9> , this is really strange - I don't think we ever return 429 from the .clear.ml servers...
@<1523701827080556544:profile|JuicyFox94> do you have any idea how this could happen?

  
  
Posted one year ago

@<1551376687504035840:profile|StraightSealion9> where do you see this error? Something if off here - is this a task running on an instance spun up by an AutoScaler application?

  
  
Posted one year ago

Sorry I've been away and just able to come back to this now. I'll run the scikit-learn job sample on my current setup and let you know if it comes back.

  
  
Posted one year ago

Hi @<1523701087100473344:profile|SuccessfulKoala55> and @<1523701205467926528:profile|AgitatedDove14> sorry for the delay, I was travelling. Indeed I see that the instance is not booting, it's sort of stuck there so I'm going to first try and find an AMI / instance type that boots via the AWS console and try again via the AutoScaler. Thanks so much for the tips!
image

  
  
Posted one year ago

@<1551376687504035840:profile|StraightSealion9> it seems to me the AWS instance simply fails to start (and thus never pulls from the queue). Can you share the instance log?

  
  
Posted one year ago

So I've gotten a little further by tweaking my VPC setup. I see that the autoscaler spins up new instances w/o a public ip so that means specifying a private subnet and having a NAT gateway on the subnet right? I tried running the scikit-learn joblib sample on a CPU instance, and the experiment definitely ran but I see another error. Could there be some permissions missing somewhere?

2023-04-03 15:01:00
2023-04-03 13:00:51,087 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/aws-autoscaler/ClearML Autoscaler.f302dff7cab7446f9bfeaf923622a2d0/artifacts/i-017eb35a20bd622b4/i-017eb35a20bd622b4.txt (429): 
2023-04-03 13:00:51,087 - clearml.metrics - WARNING - Failed uploading to 
 (Failed uploading object /_ApplicationInstances/aws-autoscaler/ClearML Autoscaler.f302dff7cab7446f9bfeaf923622a2d0/artifacts/i-017eb35a20bd622b4/i-017eb35a20bd622b4.txt (429): )
2023-04-03 13:00:51,087 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
  
  
Posted one year ago

BTW my first test worked great on a CPU instance. Might have been a temporary issue.

  
  
Posted one year ago