Alright, I had a look in the /tmp/.trains_agent_daemon_outabcdef.txt logs, not many insights from here. For the moment, I simply started a new trains-agent daemon in services mode and I will wait to see what happens.
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
AgitatedDove14 we have switched to a 8 core 16 gb ram machine and haven't faced the issue since. We'll let you know if it happens. But I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)
Hmmm that sounds like a good direction to follow, I'll see if I can come up with something as well. Let me know if you have a better handle on the issue...
I will try to isolate the bug, if I can, I will open an issue in trains-agent 🙂
JitteryCoyote63 could you send the log maybe ?
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
seems to run properly now
Are you saying the problem disappeared ?
Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux
shows that the trains-agent is stuck running the first experiment, not the second one, which doesnt make really sense (although I am not sure how much we can rely on that)
The task with id a445e40b53c5417da1a6489aad616fee
is not aborted and is still running
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
To clarify: trains-agent run a single service Task only
Hmm ElegantKangaroo44 low memory that might explain the behavior
BTW: 1==stop request, 3=Task Aborted/Failed
Which makes sense if it crashed on low memory...
shows that the trains-agent is stuck running the first experiment, not
the trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee
is the second trains-agent instance running inside the docker, if the task is aborted, this process should have quit...
Any suggestions on how I can reproduce it?
AgitatedDove14 Do we know what the User aborted: stopping task (3)
means? It's different than when you actually abort a task yourself: User aborted: stopping task (1)
.
I think that the problem happened because the VM we were using for the service queue workers was quite small (1 cpu, 1.5 gb ram), and the error message above might point to that.
We switched to a bigger one and will let you know if that was the problem.
I have two controller tasks running in parallel in the trains-agent services queue
ElegantKangaroo44 I tried to reproduce the "services mode" issue with no success. If it happens again let me know maybe will better understand how it happened (i.e. the "master" trains-agent gets stuck for some reason)
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3)
at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.
but I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)
I have the feeling you are right 🙂
Hi JitteryCoyote63 a few implementation details on the services-mode, because I'm not certain I understand the issue.
The docker-agent (running in services mode) will pick a Task from the services queue, then it will setup the docker for it spin it and make sure the Task starts running inside the docker (once it is running inside the docker you will see the service Task registered as additional node in the system, until the Task ends) once that happens the trains-agent will try to fetch the next Task from the services queue (just to be clear there is a time when the service Task is running, and is registered under the main trains-agent node, then it should "pass" to the newly created node, meaning it is now a "standalone" Task)
What exactly is not working? Does the trains-agent run a single service Task only? Does it run a a single service Task and quits ?