Hi, I Am Trying To Setup Multi-Node Training With Pytorch Distributeddataparallel. Ddp Requres A Launch Script With A Set Of Parameters To Be Run On Each Node. One Of These Parameters Is Master Node Address. I Am Currently Using The Following Scheme:

Unanswered

This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function

Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?

passes its address as argument to the function

This seems like a great solution.

the queue is polluted with lots of cloned tasks that have to be aborted manually, and the whole job only requires only ...

I wouldn't say the queue pollution is the issue (or the multiple copies of the cloned Tasks), I think the main issue here is that the allocated nodes have to wait until all nodes are allocated, no?
Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes. This way if something goes wrong in one of the nodes, you have full visibility, but when everything works, you end up with a clean single copy.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

154 Views

0 Answers

3 years ago

one year ago