Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Question Regarding The Aws_Autoscaler: It Usually Takes ~Hours To Get A Gpu Instance Nowadays. I Was Thinking, It Would Be Much More Interesting To Stop The Instances (Clearml-Agents) Instead Of Terminating Them Once They Are Inactive, So Tha

Hi, I have a question regarding the aws_autoscaler:
It usually takes ~hours to get a GPU instance nowadays. I was thinking, it would be much more interesting to stop the instances (clearml-agents) instead of terminating them once they are inactive, so that they could be available immediately when they are needed. I would just need to wait until the instance is restarted. I am fine with paying for the idle storage while the instance is down.
That would mean the following changes to the autoscaler:
Provide an option MAX_UNACTIVE_WORKERS (integer) of the number of instances to keep in “unactive pool” (stopped instances). By default 0, all instances are terminated (current behavior). When starting a new instance, check whether this number is > 0.If yes, check whether there is still one instance stopped that can be started (by matching instances names). If yes, start it instead of requesting a new one. Add to the user-data a command to make clearml-agent starting upon boot (so that if the instance is stopped, next time it is started, the agent starts automatically).When a clearml-agent is inactive, check whether MAX_UNACTIVE_WORKERS is > 0. If it is and the number of instances stopped is < to MAX_UNACTIVE_WORKERS, stop the instance instead of terminating it.WDYT? Would that be possible?

  
  
Posted 2 years ago
Votes Newest

Answers 7


Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched

  
  
Posted 2 years ago

Yes AnxiousSeal95 , stopped instance meaning you don’t pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.

Do you get stopped instances instantely when you ask for them?

Well that’s a good question, that’s what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html , we can get InsufficientInstanceCapacity error regardless whether we launch a new instance or restart one, so I am not so sure anymore

  
  
Posted 2 years ago

Hmm got it. I think if it's only spin-up time, then it gets more complicated. You need to know that the stopped instance is for this autoscaler, and these instances will need to be manually cleared by users (or they'll continue to pay storage for them, and it's not clear from the autoscaler app that you will be doing that). Do you still think that adding this complexity has merits? What I'm afraid of is hidden costs of stopped instances (Plus, the autoscaler will need internal bookeeping to see which stopped machines are available for him, and to make sure they are aligned with its configuration)

  
  
Posted 2 years ago

No I agree, it’s probably not worth it

  
  
Posted 2 years ago

Hadrien, just making sure I get the terminology, stopped instance meaning you don't pay for it, but just its storage, right? Or is it up and idling (and then Martin's suggestion is valid)? Do you get stopped instances instantely when you ask for them?

  
  
Posted 2 years ago

instead of terminating them once they are inactive, so that they could be available immediately when they are needed.

JitteryCoyote63 I think you can increase the IDLE timeout on the autoscaler, and achive the same behavior, no ?

  
  
Posted 2 years ago

Sounds like a great feature! Maybe open a github feature request to make it happen 🙂

  
  
Posted 2 years ago
601 Views
7 Answers
2 years ago
one year ago
Tags