Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments
Hi JitteryCoyote63
I think that what happens is that the agent are registered on the same name (id). How many agent do you see in the "Workers" tab?
and what are their names ?
worker:0 worker:1 etc ?
I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent
So what you are saying is the workers randomly report on one another's experiments ?
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
one of the two experiments for the worker that is running both experiments
So this is the actual bug ? I need some more info on that, what exactly are you seeing?
So it looks like the agent, from time to time thinks it is not running an experiment
(If you are running the trains-agent with the exact same command, I (think) you will get the same worker_id in which you will end up with something similar to what you describe)
To solve it add TRAINS_WORKER_NAME="new_unique_name" trains-agent ...
I think we resolve it automatically, but based on your description it looks like we use the same worker name/id multiple times ...
JitteryCoyote63 any chance the trains-agent-1
is running in services mode ?
Which means it will spin more than a single experiment at once
This is how I start the agent that is running the two experiments in parallel:python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached
Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.
JitteryCoyote63
Picks a new experiment on top of the long one running
This is very very strange. Is the long running experiment being logged (i.e. do you still see console output in the UI)?
no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)
trains-agent-1: runs an experiment for a long time (>12h). Picks a new experiment on top of the long one running trains-agent-2: runs only one experiment at a time, normal trains-agent-3: runs only one experiment at a time, normalIn total: 4 experiments running for 3 agents
by mistake I have two agents started in one machine