Unanswered
Hi All,
I Have A Question Regarding Multi-Node Training Using The Clearml-Agent. What Is The Recommended Setup In This Case? Say I Have 3 Nodes With 3 Agents Running On Them. How Do I Make Sure They All Run The Same Job?
I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).
I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node
169 Views
0
Answers
3 years ago
one year ago