Does Clearml Have The Ability To Run A Single Experiment Across Multiple Nodes/Gpus In A K8 Cluster?

Answered

Does ClearML have the ability to run a single experiment across multiple nodes/GPUs in a k8 cluster?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Votes Newest

Answers 14

Okay, so basically the DL framework manages the master/worker relationship. I just need to use pod replicas for my k8 agents.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

In addition to an EFS mount

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Is your K8s on-prem or over cloud?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Exactly !

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AWS, I've setup the shared memory between k8 nodes

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Actually this is by default for any multi node training framework torch DDP / openmpi etc.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Maybe SuccessfulKoala55 might have more insight on setting K8s integration 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

As they are singular not plural

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

it seems like each task is setup to run on a single pod/node based on the attributes like

gpu memory

,

os

,

num of cores,

worker

BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and by definition only the "master" one does all the reporting.
That said, from the ClearML perspective you are seeing a single Task
I'm not sure that in the Info you will be able to see the WORLD_SIZE value, but at least in theory you should

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

When I click on a task details -> info tab, it seems like each task is setup to run on a single pod/node based on the attributes like gpu memory , os , num of cores, worker

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Hi BoredHedgehog47 , I think there was some confusion here - you can run on a pod with multiple GPUs, but you can't run a single task on multiple nodes

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

AgitatedDove14 How do I setup a master task to do all the reporting?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Hi BoredHedgehog47 , yes it can. You would obviously need to set it up first 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

SuccessfulKoala55 Darn, so I can only scale vertically?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Write your answer

2K Views

14 Answers

3 years ago

2 years ago