Hi, I Have A Long Running Experiment That Was Running On Aws Instance That Got Killed After ~4 Days With The Following Reason:

Answered

Hi, I have a long running experiment that was running on AWS instance that got killed after ~4 days with the following reason: STATUS REASON: Forced stop (non-responsive)
What happened? The clearml-server asked the clearml-agent to stop the task because it didn’t got anything for a long time? Is this period controlled by a parameter that can be changed?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 5

Hi JitteryCoyote63 ,

The clearml-server asked the clearml-agent to stop the task because it didn’t got anything for a long time?

Seems so - there's a "non-responsive tasks" watchdog on the server in charge of doing exactly that. I assume you're using a self-hosted server?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

In any case, the watchdog setting can be controlled using the services.tasks.non_responsive_tasks_watchdog.threshold_sec server configuration setting (default is 7200 seconds)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I assume you’re using a self-hosted server?

Yes

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Well, if the task was indeed running, it's strange that it was stopped since tasks have a thread that is in charge of pinging the server to make sure the server knows they're still running, so maybe there was some network issue?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

935 Views

5 Answers

2 years ago

one year ago