Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have An Agent That Is Running Two Experiments At The Same Time: One That Was Running For A Long Time (11H) And One That The Agent Picked Up Afterwards, While The First One Was Still Running. Context: I Have 3 Agents Up (Not In Docker Mode) And All O

Hi, I have an agent that is running two experiments at the same time: one that was running for a long time (11h) and one that the agent picked up afterwards, while the first one was still running.
Context: I have 3 agents up (not in docker mode) and all of them were busy (running an experiment). All 3 agents have 2 gpus, but I only use one. That shouldn't happen right?

  
  
Posted 4 years ago
Votes Newest

Answers 27


Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID

  
  
Posted 4 years ago

I see 3 agents in the "Workers" tab

  
  
Posted 4 years ago

Hoo I found:
user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached

  
  
Posted 4 years ago

When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments

  
  
Posted 4 years ago

Hi JitteryCoyote63
I think that what happens is that the agent are registered on the same name (id). How many agent do you see in the "Workers" tab?

  
  
Posted 4 years ago

and what are their names ?
worker:0 worker:1 etc ?

  
  
Posted 4 years ago

yes

  
  
Posted 4 years ago

I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent

  
  
Posted 4 years ago

So what you are saying is the workers randomly report on one another's experiments ?

  
  
Posted 4 years ago

This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)

  
  
Posted 4 years ago

Is it because I did not specify --gpu 0 that the agent, by default pulls one experiment per available GPU?

  
  
Posted 4 years ago

Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?

  
  
Posted 4 years ago

So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running

  
  
Posted 4 years ago

one of the two experiments for the worker that is running both experiments

So this is the actual bug ? I need some more info on that, what exactly are you seeing?

  
  
Posted 4 years ago

So it looks like the agent, from time to time thinks it is not running an experiment

  
  
Posted 4 years ago

(If you are running the trains-agent with the exact same command, I (think) you will get the same worker_id in which you will end up with something similar to what you describe)
To solve it add TRAINS_WORKER_NAME="new_unique_name" trains-agent ...
I think we resolve it automatically, but based on your description it looks like we use the same worker name/id multiple times ...

  
  
Posted 4 years ago

JitteryCoyote63 any chance the trains-agent-1 is running in services mode ?
Which means it will spin more than a single experiment at once

  
  
Posted 4 years ago

This is how I start the agent that is running the two experiments in parallel:
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached

  
  
Posted 4 years ago

Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.

  
  
Posted 4 years ago

JitteryCoyote63

Picks a new experiment on top of the long one running

This is very very strange. Is the long running experiment being logged (i.e. do you still see console output in the UI)?

  
  
Posted 4 years ago

no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running

  
  
Posted 4 years ago

Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomly
trains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |

  
  
Posted 4 years ago

And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments

  
  
Posted 4 years ago

the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine

  
  
Posted 4 years ago

that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)

  
  
Posted 4 years ago

trains-agent-1: runs an experiment for a long time (>12h). Picks a new experiment on top of the long one running trains-agent-2: runs only one experiment at a time, normal trains-agent-3: runs only one experiment at a time, normalIn total: 4 experiments running for 3 agents

  
  
Posted 4 years ago

by mistake I have two agents started in one machine

  
  
Posted 4 years ago