autoscaler terminates the instance
This step should shut down the agent in the normal fashion, causing it to unregister from the server (and thus not remain there).
Additionally, the autoscaler running in clear.ml knows to match instances on the cloud with reports from the server, so it knows that a specific worker (if it appears on the server report) is actually running or not)
Hi @<1523701066867150848:profile|JitteryCoyote63> this can be set by the workers.default_timeout
setting in the apiserver.conf file, the default it 600 (seconds)
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
It's part of the protocol that they ping the server and notify they are still up
Hi @<1571308079511769088:profile|GentleParrot65> , since this is a server-side setting, no, since that would affect all users
Thank you, for your answer.
aws_autoscaler.py works as follows (based on my experiments):
- let’s assume that the instance and the worker is started
- there are no tasks running on the worker for max_idle_time_min
- autoscaler terminates the instance
- worker stops sending updates to app.clear.ml
- worker is still shown on the ui with message “Update Time a few minutes ago”
- autoscaler thinks that this worker is still idle because it’s returned via workers.get_all
- when I enqueue task in this state autoscaler doesn’t start new instance untill 600secs interval finishes
Does app.clear.ml autoscaler works the same way ?
Is it possible to see app.clear.ml autoscaler sources ?
So does this mean, that there is no workaround for bug described by H4dr1en when using app.clear.ml ?
Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?
I'm not sure it's a bug - the autoscaler running in app.clear.ml has a different implementation allowing you to specify how much time an instance can be idle, and this is unrelated to when the server will unregister a worker
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?
Hmm you mean how long it takes for the server to timeout on registered worker? I'm not sure this is easily configured
@<1523701087100473344:profile|SuccessfulKoala55> Is it possible to change this parameter on app.clear.ml ?