So in a simple "all-or-nothing"
Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node
available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is running. Agent B picks Task B and starts running Task A "talks" to Task BThis is the equivalent of "allocating 2 agents" (basically you have to preserve one and wait for the other to be available).
BTW: Is nvcc multi Node or multi GPU ? (I thought it is a single node multi-gpu)