Unanswered
			
			
 
			
	
		
			
		
		
		
		
	
			
		
		Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs.
For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke
@<1724960464275771392:profile|DepravedBee82>  the agent (and SDK) will wait for quite some time and retry even if a server is not available. The printout you've attached is not something the agent or SDK print out - is this something your code prints? In general, I can easily test that (and just did  🙂 ) by running an agent with a task and simply disconnecting the network cable - the agent will keep trying for a very long time before giving up (backoff times keep increasing, and the max retries for network connectivity is 254 by default).
By the way, the  sdk.network.iteration.max_retries_on_server_error  is not actually used by the  clearml  python package, only by the ClearML enterprise python package
178 Views
				0
Answers
				
					 
	11 months ago
				
					
						 
	11 months ago