Hi, Where Can I Find The Server Parameter To Control When The Server Is Unregistering An Agent After Not Receiving Updates? Currently It'S Quite Long (30Mins) And This Prevents The Autoscaler From Launching A New Agent

Answered

Hi, where can I find the server parameter to control when the server is unregistering an agent after not receiving updates? Currently it's quite long (30mins) and this prevents the autoscaler from launching a new agent

  				
Posted 
	one year ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 12

Hi @<1523701066867150848:profile|JitteryCoyote63> this can be set by the workers.default_timeout setting in the apiserver.conf file, the default it 600 (seconds)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

So does this mean, that there is no workaround for bug described by H4dr1en when using app.clear.ml ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					GentleParrot65
				
					0
					 × 1

autoscaler terminates the instance

This step should shut down the agent in the normal fashion, causing it to unregister from the server (and thus not remain there).
Additionally, the autoscaler running in clear.ml knows to match instances on the cloud with reports from the server, so it knows that a specific worker (if it appears on the server report) is actually running or not)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I'm not sure it's a bug - the autoscaler running in app.clear.ml has a different implementation allowing you to specify how much time an instance can be idle, and this is unrelated to when the server will unregister a worker

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

, causing it to unregister from the server (and thus not remain there).

Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi @<1571308079511769088:profile|GentleParrot65> , since this is a server-side setting, no, since that would affect all users

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

@<1523701087100473344:profile|SuccessfulKoala55> Is it possible to change this parameter on app.clear.ml ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					GentleParrot65
				
					0
					 × 1

Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down

  				
Posted 
	one year ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hmm you mean how long it takes for the server to timeout on registered worker? I'm not sure this is easily configured

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It's part of the protocol that they ping the server and notify they are still up

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thank you, for your answer.
aws_autoscaler.py works as follows (based on my experiments):

let’s assume that the instance and the worker is started
there are no tasks running on the worker for max_idle_time_min
autoscaler terminates the instance
worker stops sending updates to app.clear.ml
worker is still shown on the ui with message “Update Time a few minutes ago”
autoscaler thinks that this worker is still idle because it’s returned via workers.get_all
when I enqueue task in this state autoscaler doesn’t start new instance untill 600secs interval finishes

Does app.clear.ml autoscaler works the same way ?
Is it possible to see app.clear.ml autoscaler sources ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					GentleParrot65
				
					0
					 × 1

Write your answer

929 Views

12 Answers

one year ago