Unanswered
Hi,
I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:
` task = Task.init(...)
assume model checkpoint
if task.models['output']:
get the latest checlpoint
model_file_or_path = task.models['output'][-1].get_local_copy()
load the model checkpoint
run training code `RoughTiger69 Would the above work for you?
172 Views
0
Answers
2 years ago
one year ago