Hi All, I Have A Question Regarding Multi-Node Training Using The Clearml-Agent. What Is The Recommended Setup In This Case? Say I Have 3 Nodes With 3 Agents Running On Them. How Do I Make Sure They All Run The Same Job?

Unanswered

I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).

I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

169 Views

0 Answers

3 years ago

one year ago