CostlyOstrich36 I'm not sure what is holding it from spinning down. Unfortunately I was not around when this happened. Maybe it was AWS taking a while to terminate, or maybe it was just taking a while to register in the autoscaler.
The logs looked like this:
- Recognizing an idle worker and spinning down.
2022-09-19 12:27:33,197 - clearml.auto_scaler - INFO - Spin down instance cloud id 'i-058730639c72f91e1'
2. Recognizing a new task is available, but the worker is still idle.2022-09-19 12:32:35,698 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:32:35,816 - clearml.auto_scaler - INFO - idle worker: {'dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1': (1663590436.5344, 'c5n_4xl', <Worker: id=dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1>)}
3. A few minutes later, the task is still queued, the idle worker is still active (we have a budget of 6 AWS instances on thisaws
queue):2022-09-19 12:36:37,860 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:36:37,973 - clearml.auto_scaler - INFO - idle worker: {'dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1': (1663590436.5344, 'c5n_4xl', <Worker: id=dynamic_worker:c5n_4xl:c5n.4xlarge:i-058730639c72f91e1>)}
4. A minute later, the idle worker finally shuts down and disappears from the idle worker list, and a new instance is spun up:2022-09-19 12:37:38,389 - clearml.auto_scaler - INFO - Found 1 tasks in queue 'aws' 2022-09-19 12:37:38,506 - clearml.auto_scaler - INFO - Spinning new instance resource='c5n_4xl', prefix='dynamic_worker', queue='aws'
UnevenDolphin73 , that's an interesting case. I'll see if I can reproduce it as well. Also can you please clarify step 4 a bit? Also on step 5 - what is "holding" it from spinning down?
The instance that took a while to terminate (or has taken a while to disappear from the idle workers)
UnevenDolphin73 that s seems to be an issue with the instance shutting down, the autoscaler's behaviour seems normal. Can you try to get the system log for the instance? Maybe there will be some clues there...
I cannot, the instance is long gone... But it's not different to any other scaled instances, it seems it just took a while to register in ClearML