Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Where Can I Find The Server Parameter To Control When The Server Is Unregistering An Agent After Not Receiving Updates? Currently It'S Quite Long (30Mins) And This Prevents The Autoscaler From Launching A New Agent

Hi, where can I find the server parameter to control when the server is unregistering an agent after not receiving updates? Currently it's quite long (30mins) and this prevents the autoscaler from launching a new agent

  
  
Posted one year ago
Votes Newest

Answers 12


@<1523701087100473344:profile|SuccessfulKoala55> Is it possible to change this parameter on app.clear.ml ?

  
  
Posted one year ago

autoscaler terminates the instance

This step should shut down the agent in the normal fashion, causing it to unregister from the server (and thus not remain there).
Additionally, the autoscaler running in clear.ml knows to match instances on the cloud with reports from the server, so it knows that a specific worker (if it appears on the server report) is actually running or not)

  
  
Posted one year ago

I'm not sure it's a bug - the autoscaler running in app.clear.ml has a different implementation allowing you to specify how much time an instance can be idle, and this is unrelated to when the server will unregister a worker

  
  
Posted one year ago

, causing it to unregister from the server (and thus not remain there).

Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?

  
  
Posted one year ago

It's part of the protocol that they ping the server and notify they are still up

  
  
Posted one year ago

Hi @<1523701066867150848:profile|JitteryCoyote63> this can be set by the workers.default_timeout setting in the apiserver.conf file, the default it 600 (seconds)

  
  
Posted one year ago

Hmm you mean how long it takes for the server to timeout on registered worker? I'm not sure this is easily configured

  
  
Posted one year ago

Hi @<1571308079511769088:profile|GentleParrot65> , since this is a server-side setting, no, since that would affect all users

  
  
Posted one year ago

Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down

  
  
Posted one year ago

Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?

  
  
Posted one year ago

So does this mean, that there is no workaround for bug described by H4dr1en when using app.clear.ml ?

  
  
Posted one year ago

Thank you, for your answer.
aws_autoscaler.py works as follows (based on my experiments):

  • let’s assume that the instance and the worker is started
  • there are no tasks running on the worker for max_idle_time_min
  • autoscaler terminates the instance
  • worker stops sending updates to app.clear.ml
  • worker is still shown on the ui with message “Update Time a few minutes ago”
  • autoscaler thinks that this worker is still idle because it’s returned via workers.get_all
  • when I enqueue task in this state autoscaler doesn’t start new instance untill 600secs interval finishes

Does app.clear.ml autoscaler works the same way ?
Is it possible to see app.clear.ml autoscaler sources ?

  
  
Posted one year ago
665 Views
12 Answers
one year ago
one year ago
Tags