Hi Guys, Last Night One Of Our Agents (0.16.1) Was Disconnected From Our Trains-Server While Executing An Experiment. I Saw That Because The Experiment It Was Running Had The Status Aborted And I Could Not See The Agent In The List Of Available Workers. H

Answered

Hi guys,
Last night one of our agents (0.16.1) was disconnected from our trains-server while executing an experiment. I saw that because the experiment it was running had the status Aborted and I could not see the agent in the list of available workers. Hence I res-established the connection and the agent sent the logs to the server, but killed the task.
I can see in the logs, after reconnection of the agent to the server:
2020-11-12 09:00:33 User aborted: stopping task (3) 2020-11-12 09:00:33 020-11-12 09:00:11,203 - trains.Task - ERROR - Action failed <400/110: models.update_for_task/v1.0 (Invalid task status (model can only be updated for tasks in the ['created', 'in_progress'] states): id=..., company=..)> (task=..)Shouldn't the trains-agent be able to detect that the server is not available, stack the logs locally and as soon as server is reachable again, send the logs of the running experiment to the server and continue the experiment instead of killing it?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 2

very cool, good to know, thanks SuccessfulKoala55 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi JitteryCoyote63 ,
This is behavior is actually a result of a cleanup service running inside the Trains Server, called the non-responsive tasks watchdog . This service is meant to clean up any dangling tasks/experiments that were forgotten in an invalid or running state and did not report for a long time (for example, when you run a development code and simply abort it in your debugger).
The non-responsive timeout (after which such experiments are deemed non-responsive) is currently set to 2 hours, and can be easily changed in the server's configuration (setting is under services.tasks.non_responsive_tasks_watchdog.threshold_sec , so you can add a services.conf configuration file and set the non_responsive_tasks_watchdog.threshold_sec value to any number you wish)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

2 Answers

4 years ago

one year ago