Answered

Hi, I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:

Hi,
I am trying to setup multi-node training with PyTorch DistributedDataParallel. DDP requres a launch script with a set of parameters to be run on each node. One of these parameters is master node address. I am currently using the following scheme:
Run a script at local computer that creates main task and send it for remote execution with task.execute_remotely() This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function Cloned tasks are enqueued and picked up by other agentsIs this a correct way? It surely has some drawbacks: the queue is polluted with lots of cloned tasks that have to be aborted manually, and the whole job only requires only one idle agent to start, which can lead to several jobs blocking each other

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					UnevenHorse85
				
					0
					 × 1

Votes Newest

Answers 11

@ https://app.slack.com/team/UT8T0V3NE is there a non-free version support for the feature of preempting lower priority tasks to allow a higher priority task to come in?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function

Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?

passes its address as argument to the function

This seems like a great solution.

the queue is polluted with lots of cloned tasks that have to be aborted manually, and the whole job only requires only ...

I wouldn't say the queue pollution is the issue (or the multiple copies of the cloned Tasks), I think the main issue here is that the allocated nodes have to wait until all nodes are allocated, no?
Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes. This way if something goes wrong in one of the nodes, you have full visibility, but when everything works, you end up with a clean single copy.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

preempting lower priority tasks to allow a higher priority task to come in

Well this is usually outside of the scope of "single researcher" / "tiny team"...
This typically a large scale problem
That said, it will be fairly easy to write a service that aborts Tasks, "tags them to be "continued", then later (at night?!) push them back into a queue... wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 looks like service-writing-time for me!
PS can you point me to some official example/ doc for how to persist/restore state so that tasks are restartable?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

looks like service-writing-time for me!

Nice!

persist/restore state so that tasks are restartable?

You mean if you write preemption-ready training code ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 yes

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Yes, i basically plan to use ClearML as user-friendly cluster manager

and it is 🙂
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunately not part of the open-source)
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 let me reach out to my pocket there 😉

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

RoughTiger69 yes I think "Scale" tier covers it 😉

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for the answer!

the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?

Yes, i basically plan to use ClearML as user-friendly cluster manager

Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes.

Nice idea, i will try it out!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					UnevenHorse85
				
					0
					 × 1

` task = Task.init(...)

assume model checkpoint

if task.models['output']:

get the latest checlpoint

model_file_or_path = task.models['output'][-1].get_local_copy()

load the model checkpoint

run training code `RoughTiger69 Would the above work for you?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

960 Views

11 Answers

3 years ago

one year ago