Hi, I Have An Agent That Is Running Two Experiments At The Same Time: One That Was Running For A Long Time (11H) And One That The Agent Picked Up Afterwards, While The First One Was Still Running. Context: I Have 3 Agents Up (Not In Docker Mode) And All O

Answered

Hi, I have an agent that is running two experiments at the same time: one that was running for a long time (11h) and one that the agent picked up afterwards, while the first one was still running.
Context: I have 3 agents up (not in docker mode) and all of them were busy (running an experiment). All 3 agents have 2 gpus, but I only use one. That shouldn't happen right?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 27

yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

So it looks like the agent, from time to time thinks it is not running an experiment

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

(If you are running the trains-agent with the exact same command, I (think) you will get the same worker_id in which you will end up with something similar to what you describe)
To solve it add TRAINS_WORKER_NAME="new_unique_name" trains-agent ...
I think we resolve it automatically, but based on your description it looks like we use the same worker name/id multiple times ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

by mistake I have two agents started in one machine

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hoo I found:
user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

one of the two experiments for the worker that is running both experiments

So this is the actual bug ? I need some more info on that, what exactly are you seeing?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see 3 agents in the "Workers" tab

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

This is how I start the agent that is running the two experiments in parallel:
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomly
trains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63

Picks a new experiment on top of the long one running

This is very very strange. Is the long running experiment being logged (i.e. do you still see console output in the UI)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Is it because I did not specify --gpu 0 that the agent, by default pulls one experiment per available GPU?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

trains-agent-1: runs an experiment for a long time (>12h). Picks a new experiment on top of the long one running trains-agent-2: runs only one experiment at a time, normal trains-agent-3: runs only one experiment at a time, normalIn total: 4 experiments running for 3 agents

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

and what are their names ?
worker:0 worker:1 etc ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So what you are saying is the workers randomly report on one another's experiments ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi JitteryCoyote63
I think that what happens is that the agent are registered on the same name (id). How many agent do you see in the "Workers" tab?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 any chance the trains-agent-1 is running in services mode ?
Which means it will spin more than a single experiment at once

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

27 Answers

4 years ago

one year ago