Hi SubstantialElk6 , maybe SuccessfulKoala55 might have more input on this 🙂
[Distributed Training] Hi, i have a ClearML setup with K8SGlue that spins up pods of 4 GPUs when picking tasks off the clearml queue. We would now want to proceed with multi-node training, and some of the examples we are trying are here.
https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide
We have yet to try this but I understand that all the logs and scalars will be consolidated at RANK0, and ClearML simply pulls them from RANK0. My questions are as follows;
For the torch.distributed.launch and torchrun examples above, how should we launch the master and each worker via ClearML queues? If we mamaged to launch above, how would we know the IP addresses since this information is required apriori and ClearML is launching K8S Pods which means i won't have publicly addressable IP addresses? Same question for the mpirun example, how do we do the above with clearml queues and without knowledge of the IP addresses?