Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Started A Trains-Agent (0.15) In Services Mode (Full Command:

Hi,
I started a trains-agent (0.15) in services mode (full command: trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only ).
It picked a first task, seems to run it in the main process instead of in a docker container (output of ps -aux ) :
user 1612 0.3 2.3 149704 41196 ? Sl Jun09 32:46 /home/user/miniconda3/bin/python /home/user/miniconda3/bin/trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only user 23643 0.0 1.4 304240 25240 ? Sl 09:38 0:00 docker run -t -e TRAINS_WORKER_ID=allegro-trains-machine:cpu:1:service:a445e40b53c5417da1a6489aad616fee -v /tmp/.trains_agent.8p2_nfdr.cfg:/root/trains.conf -v /tmp/train root 28329 0.3 2.4 174204 43376 ? Sl+ 09:40 1:01 python3 -u -m trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee root 28337 0.4 3.0 835412 52808 ? Sl+ 09:40 1:05 /root/.trains/venvs-builds/3.6/bin/python -u controller.pyThe problem is that the agent is not listed anymore as available worker while executing the first task. The only available worker related to the services queue is the one executing the first task: allegro-trains-machine:cpu:1:service:a445e40b53c5417da1a6489aad616fee . Therefore, any other task in the services queue is stuck in pending state!
Also, controller.py is a Task that schedule other tasks (train/test) and wait for their execution (while loop). My guess is that the main process (the trains-agent) is executing the first task in its own process and therefore is stuck in the execution of the first task..

  
  
Posted 4 years ago
Votes Newest

Answers 23


AgitatedDove14 Do we know what the User aborted: stopping task (3) means? It's different than when you actually abort a task yourself: User aborted: stopping task (1) .

I think that the problem happened because the VM we were using for the service queue workers was quite small (1 cpu, 1.5 gb ram), and the error message above might point to that.
We switched to a bigger one and will let you know if that was the problem.

  
  
Posted 4 years ago

I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though

  
  
Posted 4 years ago

The task with id a445e40b53c5417da1a6489aad616fee is not aborted and is still running

  
  
Posted 4 years ago

Hi JitteryCoyote63 a few implementation details on the services-mode, because I'm not certain I understand the issue.
The docker-agent (running in services mode) will pick a Task from the services queue, then it will setup the docker for it spin it and make sure the Task starts running inside the docker (once it is running inside the docker you will see the service Task registered as additional node in the system, until the Task ends) once that happens the trains-agent will try to fetch the next Task from the services queue (just to be clear there is a time when the service Task is running, and is registered under the main trains-agent node, then it should "pass" to the newly created node, meaning it is now a "standalone" Task)

What exactly is not working? Does the trains-agent run a single service Task only? Does it run a a single service Task and quits ?

  
  
Posted 4 years ago

of which task/process?

  
  
Posted 4 years ago

So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.

  
  
Posted 4 years ago

I will try to isolate the bug, if I can, I will open an issue in trains-agent 🙂

  
  
Posted 4 years ago

seems to run properly now

Are you saying the problem disappeared ?

  
  
Posted 4 years ago

-> seems to run properly now

  
  
Posted 4 years ago

The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3) at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.

  
  
Posted 4 years ago

Alright, I had a look in the /tmp/.trains_agent_daemon_outabcdef.txt logs, not many insights from here. For the moment, I simply started a new trains-agent daemon in services mode and I will wait to see what happens.

  
  
Posted 4 years ago

ElegantKangaroo44 I tried to reproduce the "services mode" issue with no success. If it happens again let me know maybe will better understand how it happened (i.e. the "master" trains-agent gets stuck for some reason)

  
  
Posted 4 years ago

From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?

  
  
Posted 4 years ago

AgitatedDove14 we have switched to a 8 core 16 gb ram machine and haven't faced the issue since. We'll let you know if it happens. But I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)

  
  
Posted 4 years ago

but I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)

I have the feeling you are right 🙂

  
  
Posted 4 years ago

To clarify: trains-agent run a single service Task only

  
  
Posted 4 years ago

Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux shows that the trains-agent is stuck running the first experiment, not the second one, which doesnt make really sense (although I am not sure how much we can rely on that)

  
  
Posted 4 years ago

shows that the trains-agent is stuck running the first experiment, not

the trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee
is the second trains-agent instance running inside the docker, if the task is aborted, this process should have quit...

Any suggestions on how I can reproduce it?

  
  
Posted 4 years ago

I have two controller tasks running in parallel in the trains-agent services queue

  
  
Posted 4 years ago

Hmmm that sounds like a good direction to follow, I'll see if I can come up with something as well. Let me know if you have a better handle on the issue...

  
  
Posted 4 years ago

Hmm ElegantKangaroo44 low memory that might explain the behavior
BTW: 1==stop request, 3=Task Aborted/Failed
Which makes sense if it crashed on low memory...

  
  
Posted 4 years ago

JitteryCoyote63 could you send the log maybe ?

  
  
Posted 4 years ago

maybe it can help 🙂

  
  
Posted 4 years ago
1K Views
23 Answers
4 years ago
one year ago
Tags
Similar posts