and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Do you mean there is no shared filesystem among the different machines ?
And do you need to run your code inside a docker, or is venv enough ?
we all use conda, guess not need for docker
i think it can only run on multiple GPU at one node
Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂
PompousHawk82 what do you mean by ?
but the thing is that i can only use master to log everything
This is assuming you can just run two copies of your code, and they will become aware of one another.
I’ll get back to you after i get this done
i’m just curious about how does trains server on different nodes communicate about the task queue
i’m just curious about how does trains server on different nodes communicate about the task queue
We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part 🙂
Basically I think I'm asking, is your code multi-node enabled to begin with ?
i see, now we are trying to let the agent pop up the experiment separately and see if they can communicate with each other, right?
never done this before, let me do a quick search
So that means your home folder is always mapped to ~/ on any machine you ssh to ?
Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:trains-agent execute --docker --id aabbcc
This means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)
(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:trains-agent execute --id <past_task_id_here>
Let's see how that goes 🙂
Yeah, i’m done with the test, not i can run as what you said
Not for now, i think it can only run on multiple GPU at one node
I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
Can I assume that if we have two agents spinning the same experiment, your code will take it from there?
Is this true ?
but the thing is that i can only use master to log everything
it’s shared but only user files, everything under ~/ directory