Hi, If I'Ve Clearml Agents Installed On Several Servers, Each With A Single Gpu. How Can I Train A Gpt2 Model That Would Require Multiple Gpus?

Answered

Hi, if i've ClearML agents installed on several servers, each with a single GPU. How can I train a gpt2 model that would require multiple GPUs?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 8

ClearML is usually just moving the execution down to the nodes. I'm unsure what role ClearML is playing in your issue

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScaryJellyfish75
				
					0
					 × 1

I would recommend you start getting familiar with the distributed training modes (for example DDP in PyTorch). There are some important concepts that are required to train multi-GPU and multi-devices.

None

Before you start with a sophisticated model, I'd recommend to try this training setup with a baseline model, check that data, gradients, weights, metrics, etc. are synced correctly.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScaryJellyfish75
				
					0
					 × 1

@<1523701504827985920:profile|SubstantialElk6> you can always have your code get the IP and save it in the task metadata (user properties, for example), and query all other tasks with some identical tag for their IP

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks. The challenge we encountered is that we only expose our Devs to the ClearML queues, so users have no idea what's beyond the queue except that it will offer them the resources associated with the queue. In the backend, each queue is associated with more than one host.

So what we tried is as followed.
We create a train.py script much like what Tobias shared above. In this script, we use the socket library to pull the ipaddr.

import socket
hostname=socket.gethostname()
ipaddr=docker.gethostbyname(hostname)

Above script is then used to generate a ClearML Task.

Then we create a ClearML pipeline that look as follows, all from the same task.

            |-- taskslave1
taskmaster--|-- taskslave2
            |-- taskslave3

The i[addr from the master task is expected to be retrived and passed to the slave tasks as a argument.

Two problems come in when running the pipeline;

Taskmaster is actually waiting to sync with the configured number of nodes, so its not returning and in turn the IP addr cannot be passed on to the slave nodes.
The IPAddr pulled is actually that of the docker ip, which cannot be pinged from another host.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

IMHO ClearML would just start the execution on multiple hosts. Keep in mind that the hosts need to be on the same LAN and have a very high bandwidth.

What you are looking for is called "DistributedDataParallel". Maybe this tutorial gives you a starting point:
None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScaryJellyfish75
				
					0
					 × 1

From ClearML perspective, how would we enable this, considering we don't have direct control or even IP of the agents

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Well, if you need an external IP, you'll probably want to configure the docker params to use the host network

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yeah.. issue is ClearML unable to talk to the nodes cos pytorch distributed needs to know their IP. There is some sort of integration missing that would enable this.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

2K Views

8 Answers

2 years ago