So in a simple "all-or-nothing"
Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...
Hi ExcitedFish86
Good question, how do you "connect" the 3 nodes? (i.e. what the framework you are using)
Lets start with a simple setup. Multi-node DDP in pytorch
not really... what do you mean by "free" agent?
available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is running. Agent B picks Task B and starts running Task A "talks" to Task BThis is the equivalent of "allocating 2 agents" (basically you have to preserve one and wait for the other to be available).
BTW: Is nvcc multi Node or multi GPU ? (I thought it is a single node multi-gpu)
So in theory you can clone yourself 2 extra times and push into an execution queue, but the issue might be actually making sure the resources are available. what did you have in mind?
I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived
pytorch DDP
with what backend ? gloo ? nvcc ? openmpi ?
The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node