Hi, I Started A Trains-Agent (0.15) In Services Mode (Full Command:

Answered

Hi,
I started a trains-agent (0.15) in services mode (full command: trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only ).
It picked a first task, seems to run it in the main process instead of in a docker container (output of ps -aux ) :
user 1612 0.3 2.3 149704 41196 ? Sl Jun09 32:46 /home/user/miniconda3/bin/python /home/user/miniconda3/bin/trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only user 23643 0.0 1.4 304240 25240 ? Sl 09:38 0:00 docker run -t -e TRAINS_WORKER_ID=allegro-trains-machine:cpu:1:service:a445e40b53c5417da1a6489aad616fee -v /tmp/.trains_agent.8p2_nfdr.cfg:/root/trains.conf -v /tmp/train root 28329 0.3 2.4 174204 43376 ? Sl+ 09:40 1:01 python3 -u -m trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee root 28337 0.4 3.0 835412 52808 ? Sl+ 09:40 1:05 /root/.trains/venvs-builds/3.6/bin/python -u controller.pyThe problem is that the agent is not listed anymore as available worker while executing the first task. The only available worker related to the services queue is the one executing the first task: allegro-trains-machine:cpu:1:service:a445e40b53c5417da1a6489aad616fee . Therefore, any other task in the services queue is stuck in pending state!
Also, controller.py is a Task that schedule other tasks (train/test) and wait for their execution (while loop). My guess is that the main process (the trains-agent) is executing the first task in its own process and therefore is stuck in the execution of the first task..

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 23

shows that the trains-agent is stuck running the first experiment, not

the trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee
is the second trains-agent instance running inside the docker, if the task is aborted, this process should have quit...

Any suggestions on how I can reproduce it?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmmm that sounds like a good direction to follow, I'll see if I can come up with something as well. Let me know if you have a better handle on the issue...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 we have switched to a 8 core 16 gb ram machine and haven't faced the issue since. We'll let you know if it happens. But I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ElegantKangaroo44
				
					0
					 × 1

Alright, I had a look in the /tmp/.trains_agent_daemon_outabcdef.txt logs, not many insights from here. For the moment, I simply started a new trains-agent daemon in services mode and I will wait to see what happens.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi JitteryCoyote63 a few implementation details on the services-mode, because I'm not certain I understand the issue.
The docker-agent (running in services mode) will pick a Task from the services queue, then it will setup the docker for it spin it and make sure the Task starts running inside the docker (once it is running inside the docker you will see the service Task registered as additional node in the system, until the Task ends) once that happens the trains-agent will try to fetch the next Task from the services queue (just to be clear there is a time when the service Task is running, and is registered under the main trains-agent node, then it should "pass" to the newly created node, meaning it is now a "standalone" Task)

What exactly is not working? Does the trains-agent run a single service Task only? Does it run a a single service Task and quits ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have two controller tasks running in parallel in the trains-agent services queue

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3) at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hmm ElegantKangaroo44 low memory that might explain the behavior
BTW: 1==stop request, 3=Task Aborted/Failed
Which makes sense if it crashed on low memory...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

To clarify: trains-agent run a single service Task only

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I will try to isolate the bug, if I can, I will open an issue in trains-agent 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

maybe it can help 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

ElegantKangaroo44 I tried to reproduce the "services mode" issue with no success. If it happens again let me know maybe will better understand how it happened (i.e. the "master" trains-agent gets stuck for some reason)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux shows that the trains-agent is stuck running the first experiment, not the second one, which doesnt make really sense (although I am not sure how much we can rely on that)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

but I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)

I have the feeling you are right 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

JitteryCoyote63 could you send the log maybe ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The task with id a445e40b53c5417da1a6489aad616fee is not aborted and is still running

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

of which task/process?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

seems to run properly now

Are you saying the problem disappeared ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Do we know what the User aborted: stopping task (3) means? It's different than when you actually abort a task yourself: User aborted: stopping task (1) .

I think that the problem happened because the VM we were using for the service queue workers was quite small (1 cpu, 1.5 gb ram), and the error message above might point to that.
We switched to a bigger one and will let you know if that was the problem.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ElegantKangaroo44
				
					0
					 × 1

-> seems to run properly now

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

2K Views

23 Answers

5 years ago

2 years ago