Hi, I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:

Unanswered

AgitatedDove14 looks like service-writing-time for me!
PS can you point me to some official example/ doc for how to persist/restore state so that tasks are restartable?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

314 Views

0 Answers

3 years ago

2 years ago