Heya, Is There Any Plan For Clearml To Leverage The New

Answered

Heya, is there any plan for ClearML to leverage the new https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/ tech introduced with the A100 TPU, so we can config a ClearML agent deployed on a machine equipped with an A100 to count as an arbitrary number of workers and dispatch tasks in the queue to multiple GPU instances of the same machine ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Votes Newest

Answers 7

I think it's supposed to be out early Nov 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I could improve the cost-efficiency of my provisionned GCP A100 instances

But their pricing is linear, if you do not need a100 get a cheaper instance ?! no?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi FierceHamster54
This is already supported, unfortunately the open-source version only supports static allocation (i.e you can spin multiple agents and connect each one to specific number of GPUs), the dynamic option (where you have single agent allocating jobs to multiple GPUs / Slices is only part of the enterprise edition
(there is the hidden assumption there that if you spent so much on a DGX you are probably not a small team 🙂 )

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey, I'm a SaaS user in PRO tier and I was wondering if it was a feature available on the auto-scaler apps so I could improve the cost-efficiency of my provisionned GCP A100 instances

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

There is a gap in the GPU offer on GCP and there is no modern middle-ground for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM so I figured out if we could batch 2 training tasks on the same A100 instance we would still be on the winning side in term of CUDA cores and getting the most of the GPU-time we're paying.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Oh wow, would definitely try it out if there were an Autoscaler App integrating it with ClearML

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM

Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

7 Answers

3 years ago

2 years ago