Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I Encountered A Weird Edge Case With The Aws Auto-Scaler, Wondering If There Are Any Solutions Or If This Is A Known Issue. Something As Follows Happened:

I encountered a weird edge case with the AWS Auto-scaler, wondering if there are any solutions or if this is a known issue.
Something as follows happened:
The queue is empty, instance A was discovered as idle, and was spun down. While it is spinning down, it is still marked as an idle worker by ClearML. During this time, a task came up to the queue. Since there is an idle worker, the autoscaler attempts to use it (?) and can't proceed After some minutes, instance A is finally terminated and removed from ClearML "idle workers" list. Autoscaler now spins up a new instance
Seems like once the instruction to spin down an instance is given, the worker should no longer be discovered and/or interacted with.
Has anyone encountered this?

  
  
Posted 2 years ago
Votes Newest

Answers 6


CostlyOstrich36 I'm not sure what is holding it from spinning down. Unfortunately I was not around when this happened. Maybe it was AWS taking a while to terminate, or maybe it was just taking a while to register in the autoscaler.

The logs looked like this:

  1. Recognizing an idle worker and spinning down.
    2022-09-19 12:27:33,197 - clearml.auto_scaler - INFO - Spin down instance cloud id 'i-058730639c72f91e1'2. Recognizing a new task is available, but the worker is still idle.
    2022-09-19 12:32:35,698 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:32:35,816 - clearml.auto_scaler - INFO - idle worker: {'dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1': (1663590436.5344, 'c5n_4xl', <Worker: id=dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1>)}3. A few minutes later, the task is still queued, the idle worker is still active (we have a budget of 6 AWS instances on this aws queue):
    2022-09-19 12:36:37,860 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:36:37,973 - clearml.auto_scaler - INFO - idle worker: {'dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1': (1663590436.5344, 'c5n_4xl', <Worker: id=dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1>)}4. A minute later, the idle worker finally shuts down and disappears from the idle worker list, and a new instance is spun up:
    2022-09-19 12:37:38,389 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:37:38,506 - clearml.auto_scaler - INFO - Spinning new instance resource='c5n_4xl', prefix='dynamic_worker', queue='aws'
  
  
Posted 2 years ago

UnevenDolphin73 , that's an interesting case. I'll see if I can reproduce it as well. Also can you please clarify step 4 a bit? Also on step 5 - what is "holding" it from spinning down?

  
  
Posted 2 years ago

The instance that took a while to terminate (or has taken a while to disappear from the idle workers)

  
  
Posted 2 years ago

UnevenDolphin73 that s seems to be an issue with the instance shutting down, the autoscaler's behaviour seems normal. Can you try to get the system log for the instance? Maybe there will be some clues there...

  
  
Posted 2 years ago

I cannot, the instance is long gone... But it's not different to any other scaled instances, it seems it just took a while to register in ClearML

  
  
Posted 2 years ago

You mean the new instance?

  
  
Posted 2 years ago