[Distributed Training] Hi, I Have A Clearml Setup With K8Sglue That Spins Up Pods Of 4 Gpus When Picking Tasks Off The Clearml Queue. We Would Now Want To Proceed With Multi-Node Training, And Some Of The Examples We Are Trying Are Here.

Answered

[Distributed Training] Hi, i have a ClearML setup with K8SGlue that spins up pods of 4 GPUs when picking tasks off the clearml queue. We would now want to proceed with multi-node training, and some of the examples we are trying are here.
https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide

We have yet to try this but I understand that all the logs and scalars will be consolidated at RANK0, and ClearML simply pulls them from RANK0. My questions are as follows;
For the torch.distributed.launch and torchrun examples above, how should we launch the master and each worker via ClearML queues? If we mamaged to launch above, how would we know the IP addresses since this information is required apriori and ClearML is launching K8S Pods which means i won't have publicly addressable IP addresses? Same question for the mpirun example, how do we do the above with clearml queues and without knowledge of the IP addresses?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers

Hi SubstantialElk6 , maybe SuccessfulKoala55 might have more input on this 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

3K Views

1 Answer

2 years ago