Hi All, Is There A Way To Schedule The Tasks From The Queue Onto The Gpu Instances Based On Factors Such As Gpu Utilisation, Number Of Cpu Cores Present, Free Memory Or Custom Parameters Such As Priority Of The Task, Estimated Time Etc?

Answered

Hi all, is there a way to schedule the tasks from the queue onto the GPU instances based on factors such as GPU utilisation, number of CPU cores present, free memory or custom parameters such as priority of the task, estimated time etc?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CharmingPuppy6
				
					0
					 × 1

Votes Newest

Answers 12

Hi CharmingPuppy6
Basically yes there is.
The way clearml is designed, is to have queues abstract different types pf resources. for example a queue for single gpu jobs (let's nam "single_gpu") and a queue for dual gpu jobs (let's name it "single_gpu").
Then you spin agents on machines and have the agents pull jobs from specific queues based on the hardware they have. For example we can have a 4 GPU machine with 3 agents, one agent connect to 2xGPUs and pulling Tasks from the "dual_gpu" queue, and then two more agents each one connect with a single gpu, pulling Tasks from "single_gpu" queue. The same idea also scales if you connect it to a Kubernetes cluster isntead of bare-metal / cloud.
Does that answer the question ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I can definitely see your point from the "DevOps" perspective, but from the user perspective it put the "liability" on me to "optimize" the resource, which to me sounds a bit much to put on my tiny shoulders, I just have a general knowledge on what I need. For example lots of CPUs (because I know my process scales well with more cpus), or large memory (because I have an entire dataset in memory). Personally (and really only my personal perspective), I'd rather have the option to select from a limited list of options (which can amount to a list of queues).
Now from the DevOps perspective that means spinning multiple k8s glues (with diff template YAML each one) and connect each job template to a different queue, while naming the queues with the actual resources.
for example the following queues are quite self explanatory:
24cpu_128gb_hd1tb, 48cpu_128gb_hd1tb, 4v100_32cpu_256gb_hd2tb
This means that as a DevOps I can put a max limit per queue/resource and always hav ethe ability to add more (just my 2 cents)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This will be quite easy to implement using the clearml k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code

I’m pretty new to this. So, it’d be great if you can do that.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CharmingPuppy6
				
					0
					 × 1

The idea of queues is not to let the users have too much freedom on the one hand and on the other allow for maximum flexibility & control.
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
CharmingPuppy6 wdyt ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Can I assume we are talking Kubernetes under the hood for the resource allocation?

yes

The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..

While i agree that over-detailing makes user experience more cumbersome, I’d want to leave it to the user to choose the level of granularity and set the rest of the parameters to default values or not part of the scheduling considerations. For. example, a user might have a task that has some CPU-intensive things and want to be able to choose the right set of resources.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CharmingPuppy6
				
					0
					 × 1

Hey AgitatedDove14 ,

The way

clearml

is designed, is to have queues abstract different types pf resources.

Configuring multiple queues and multiple agents based on the resources can be a solution for many use-cases . But when the instances are non-homogeneous, there can be too many combinations of resources based no.of GPUs, no. of cores, disk space etc. that work for various workloads. I’m thinking that creating as many agents and queues can get messy for managing as well as for the users who have to choose the right queue.

I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CharmingPuppy6
				
					0
					 × 1

I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.

Can I assume we are talking Kubernetes under the hood for the resource allocation ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

ReassuredTiger98 thanks for sharing those threads. I found them very insightful.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CharmingPuppy6
				
					0
					 × 1

I would think that a combination of kubernetes (I believe the preferred way to support multiple users at once, but open to being wrong) and individual queue's is probably the solution here.

for example; in kubernetes you could setup an agent to listen to bob-queue and another agent to listen to alice-queue. In the kubernetes dashboard you could assign a certain amount of cpu/memory and if using taints, gpu or not.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AlertBlackbird30
				
					0
					 × 1

I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace)

This will be quite easy to implement using the cleamrl k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

this obviously doesn't stop bad actors from assigning (say) an alice job over to bobs-queue but.. I am not sure of anyway to solve that.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AlertBlackbird30
				
					0
					 × 1

CharmingPuppy6 These threads may also be interesting for you: https://clearml.slack.com/archives/CTK20V944/p1614867532303700 https://clearml.slack.com/archives/CTK20V944/p1617963053397600

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Write your answer

1K Views

12 Answers

3 years ago

2 years ago