Unanswered
Hi,
I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:
Thanks for the answer!
the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
Yes, i basically plan to use ClearML as user-friendly cluster manager
Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes.
Nice idea, i will try it out!
160 Views
0
Answers
3 years ago
one year ago