and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Do you mean there is no shared filesystem among the different machines ?
Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:trains-agent execute --docker --id aabbcc
This means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)
I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Yeah, i’m done with the test, not i can run as what you said
PompousHawk82 what do you mean by ?
but the thing is that i can only use master to log everything
i think it can only run on multiple GPU at one node
Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂
i’m just curious about how does trains server on different nodes communicate about the task queue
We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part 🙂
This is assuming you can just run two copies of your code, and they will become aware of one another.
And do you need to run your code inside a docker, or is venv enough ?
I’ll get back to you after i get this done
but the thing is that i can only use master to log everything
i’m just curious about how does trains server on different nodes communicate about the task queue
Not for now, i think it can only run on multiple GPU at one node
never done this before, let me do a quick search
So that means your home folder is always mapped to ~/ on any machine you ssh to ?
i see, now we are trying to let the agent pop up the experiment separately and see if they can communicate with each other, right?
Basically I think I'm asking, is your code multi-node enabled to begin with ?
it’s shared but only user files, everything under ~/ directory
(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:trains-agent execute --id <past_task_id_here>
Let's see how that goes 🙂
we all use conda, guess not need for docker
Can I assume that if we have two agents spinning the same experiment, your code will take it from there?
Is this true ?