Unanswered
Hi,
I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:
looks like service-writing-time for me!
Nice!
persist/restore state so that tasks are restartable?
You mean if you write preemption-ready training code ?
185 Views
0
Answers
2 years ago
one year ago