Is There An Autoscaler Equivalent For K8S? That Is, A Service That Will Launch Pods Based On Incoming Requests?

Answered

Is there an autoscaler equivalent for K8s? That is, a service that will launch pods based on incoming requests?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Votes Newest

Answers 27

Hey @<1523701070390366208:profile|CostlyOstrich36> , thanks for the reply!
I’m familiar with the above repo, we have the ClearML Server and such deployed on K8s.
What’s lacking is documentation regarding the clearml-agent helm chart. What exactly does it offer, etc.
We’re interested in e.g. using karpenter to scale our deployments per demand, effectively replacing the AWS autoscaler.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Does it make sense to you to run several such glue instances, to manage multiple resource requirements?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Hi @<1523701083040387072:profile|UnevenDolphin73> , this is the K8s integration. You can find more here:
None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

It's all configured by the helm chart, it is the glue layer between K8s & ClearML

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Anything else you’d recommend paying attention to when setting the clearml-agent helm chart?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

I think this is what you're looking for - the agent integration
None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Yes, I’ve found that too (as mentioned, I’m familiar with the repository). My issue is still that there is documentation as to what this actually offers.
Is this simply a helm chart to run an agent on a single pod? Does it scale in any way? Basically - is it a simple agent (similiar to on-premise agents, running in the background, but here on K8s), or is it a more advanced one that offers scaling features? What is it intended for, and how does it work?

The official documentation are very sparse about all of this, and only offers the variables that one can tweak, rather than an explanation about it actually offers.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

this will make autoscaler life easier knowiing exactly how much resources you need

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

I think that's what's there. In the Scale & Enterprise version ClearML usually works together with customers to provide a glue layer for K8s or even SLURM

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

you will probably need a metrics-server on your k8s

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

it’s usually needed for autoscaler to decide when and how to scale up and down

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Perfect, thanks for the answers Valeriano. These small stuff are missing from the documentation, but I now feel much more confident in setting this up.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

We’re using karpenter (more magic keywords for me), so my understanding is that that will manage the scaling part.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Maybe @<1523701827080556544:profile|JuicyFox94> can answer some questions then…

For example, what’s the difference between agentk8sglue.nodeSelector and agentk8sglue.basePodTemplate.nodeSelector ?
Am I correct in understanding that the former decides the node type that runs the “scaler” (listening to the given agentk8sglue.queue ), and the latter for any new booted instance/pod, that will actually run the agent and the task?
Read: The former can be kept lightweight, as it does no heavy computations, the latter should have bigger resources?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

About nodeSelector you are right, one is for the agent pod while the other is used to spawn task pods

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Right, so where can one find documentation about it?
The repo just has the variables with not much explanations.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Much much appreciated 🙏

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

on OSS it’s usually the only way to as many agent deployments for any queue you define

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

But... Which queue does it listen to, and which type of instances will it use etc

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

about clearml-agent, just set resources in basepodtemplate (cpu gpu ram) so you will have a specific definition

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

how you cluster reacts is about scaling infra as much as needed (karpenter or any other cloud autoscaler should work)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

just a couple of info

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

(and any queue has it’s only basepodtemplate)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

@<1523701083040387072:profile|UnevenDolphin73> , basically, it scales to as many pods as you like. Very similar to the autoscaler but on top of K8s

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

about autoscaling it’s a complex topic regarding platform management in this case. ClearML glue simply spawn pods with resources defined in template.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

in Enterprise we support multiqueueing but it’s a different story

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Yes exactly 👍 Good news.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Write your answer

2K Views

27 Answers

2 years ago