Hi Clearml, Does Clearml Orchestration Have The Ability To Break Gpu Devices Into Virtual Ones?

Answered

Hi clearml, does clearml orchestration have the ability to break gpu devices into virtual ones?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

Votes Newest

Answers 9

Hi, do you mean out of the box virtualization of your gpu or using virtual gpus on the machine?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

What is your use case?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

BattyLizard6 to my knowledge the main issue with fractional GPU, is there is no real restriction on GPU memory allocation (with the exception of MIG slices, which is limited in other ways).
Basically one process/container can consume the maximum GPU ram on the allocated card (this also includes http://run.ai fractional solution, at least from what I understand).
This means that developer A can allocate memory so that developer B on the same GPU will start getting out-of-memory
(Notice in a few k8s solution you can ask for specific amount of GPU ram, but in runtime there are no actual restrictions)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok - thanks AgitatedDove14

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

Hi I mean something like what runai are doing, or how would you work together with http://run.ai ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

So basically development on a "shared" GPU?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

We want to have many people working on a cluster of machines and we want to be able to allocate fraction of GPU to specific jobs, to avoid starvation

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					BattyLizard6
				
					0
					 × 1

Sure thing, any specific reason for querying on multi pod per GPU?
Is this for remote development process ?
BTW: the funny thing is, on bare metal machines multi GPU woks out of he box, and deploying it with bare metal clearml-agents is very simple

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi BattyLizard6

does clearml orchestration have the ability to break gpu devices into virtual ones?

So this is fully supported on A100 with MIG slices. That said dynamic multi-tenant GPU on Kubernetes is a Kubernetes issue... We do support multi agents on the same GPU on bare metal, or over shared GPU instances over k8s with:
https://github.com/nano-gpu/nano-gpu-agent
https://github.com/intel/intel-device-plugins-for-kubernetes/tree/main/cmd/gpu_plugin#fractional-resources
https://github.com/NTHU-LSALAB/KubeShare
https://github.com/AliyunContainerService/gpushare-scheduler-extender

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

9 Answers

3 years ago

one year ago