Hi All, I Have A Question Regarding Multi-Node Training Using The Clearml-Agent. What Is The Recommended Setup In This Case? Say I Have 3 Nodes With 3 Agents Running On Them. How Do I Make Sure They All Run The Same Job?

Answered

Hi all,
I have a question regarding multi-node training using the clearml-agent. What is the recommended setup in this case? Say I have 3 nodes with 3 agents running on them. How do I make sure they all run the same job?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

Votes Newest

Answers 11

not really... what do you mean by "free" agent?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

Lets start with a simple setup. Multi-node DDP in pytorch

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

Hi ExcitedFish86
Good question, how do you "connect" the 3 nodes? (i.e. what the framework you are using)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

pytorch DDP

with what backend ? gloo ? nvcc ? openmpi ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

nvcc

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

So in theory you can clone yourself 2 extra times and push into an execution queue, but the issue might be actually making sure the resources are available. what did you have in mind?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).

I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

So in a simple "all-or-nothing"

Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is running. Agent B picks Task B and starts running Task A "talks" to Task BThis is the equivalent of "allocating 2 agents" (basically you have to preserve one and wait for the other to be available).
BTW: Is nvcc multi Node or multi GPU ? (I thought it is a single node multi-gpu)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

11 Answers

4 years ago

2 years ago