Unanswered

Hi, I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:

` task = Task.init(...)

assume model checkpoint

if task.models['output']:

get the latest checlpoint

model_file_or_path = task.models['output'][-1].get_local_copy()

load the model checkpoint

run training code `RoughTiger69 Would the above work for you?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

302 Views

0 Answers

3 years ago

2 years ago