On A Related Line But More Complicated: How Can We Ask The Autoscaler To Queue, Say, N Jobs On An N-Gpu Machine, Please? For Example, On Aws, Nvidia A100 Gpus Are Only Available On Instances With 8X A100, Which Is Overkill For A Single-Gpu Job, So Might A

Answered

On a related line but more complicated: how can we ask the Autoscaler to queue, say, N jobs on an N-GPU machine, please? For example, on AWS, NVIDIA A100 GPUs are only available on instances with 8x A100, which is overkill for a single-GPU job, so might as well use that instance for other jobs too.
(And yes, it does raise the question of optimal packing/scheduling, so might be a complicated can of worms)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SolidGoose91
				
					0
					 × 1

Votes Newest

Answers 7

. Is it possible for two agents to be utilizing the same GPU?

It is, as long as memory wise they do not limit one another.
(If you are using k8s and clearml enterprise, then it supports GPU slicing and dynamic memory allocation)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

@<1541954607595393024:profile|BattyCrocodile47>

Is that instance only able to handle one task at a time?

You could have multiple agents on the same machine, each one with its own dedicated GPU, but you will not be able to change the allocation (i.e. now I want 2 GPUs on one agent) without restarting the agents on the instance. In either case, this is for a "bare-metal" machine, and in the AWS autoscaler case, this goes under "dynamic" GPUs (see above)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

We could use our 8xA100 as 8 workers, for 8 single-gpu jobs running faster than on a single 1xV100 each.

@<1546665634195050496:profile|SolidGoose91> I think that in order to have the flexibility there you need the "dynamic" GPU allocation that is only part of the "enterprise" offering 😞
That said, why not allocate a single a100 machine? no?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?

Or can I start multiple instances of the clearml-agent process on it and then have one task per agent?

And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default . Or would this only work if they were listening to different queues?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Yes, it's pretty lame that a clearml-agent can only process one task at a time if it's not listening to a services queue 🤔

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36> Any idea please? We could use our 8xA100 as 8 workers, for 8 single-gpu jobs running faster than on a single 1xV100 each.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SolidGoose91
				
					0
					 × 1

Write your answer

2K Views

7 Answers

2 years ago