Can I assume that if we have two agents spinning the same experiment, your code will take it from there?
Is this true ?
Basically I think I'm asking, is your code multi-node enabled to begin with ?
i think it can only run on multiple GPU at one node
Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂
I’ll get back to you after i get this done
I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Not for now, i think it can only run on multiple GPU at one node
PompousHawk82 what do you mean by ?
but the thing is that i can only use master to log everything
(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:trains-agent execute --id <past_task_id_here>
Let's see how that goes 🙂
And do you need to run your code inside a docker, or is venv enough ?
and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Do you mean there is no shared filesystem among the different machines ?
Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:trains-agent execute --docker --id aabbcc
This means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)
never done this before, let me do a quick search
i’m just curious about how does trains server on different nodes communicate about the task queue
but the thing is that i can only use master to log everything
Yeah, i’m done with the test, not i can run as what you said
we all use conda, guess not need for docker
it’s shared but only user files, everything under ~/ directory
This is assuming you can just run two copies of your code, and they will become aware of one another.
So that means your home folder is always mapped to ~/ on any machine you ssh to ?
i’m just curious about how does trains server on different nodes communicate about the task queue
We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part 🙂
i see, now we are trying to let the agent pop up the experiment separately and see if they can communicate with each other, right?