In Order For A New Worker To Come Online In My K8 Cluster, Do I Need To Have An Ec2 Startup Script Init The Agent/Config, And Then Start The Daemon? Do I Have To Do This Manually Is This A Better Way?

Answered

In order for a new worker to come online in my k8 cluster, do I need to have an EC2 startup script init the agent/config, and then start the daemon? Do I have to do this manually is this a better way?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Votes Newest

Answers 30

No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Yes, this is exactly how the clearml k8s glue works (notice the resource allocation, spin nodes up/down, is done by k8s which sometimes do take some time, if you only need "bare metal nodes" on the cloud, it might be more efficient to use the aws autoscaler, that essentially does the same thing

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

How do I setup the clearml k8s glue?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

So that it spins up nodes

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Would I copy and paste this block to produce another queue and k8 glue agent?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml-agent/values.yaml#L35-L101

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

For instance, if I wanted the default queue and gpu queue that I create, how do I do that?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml-agent/values.yaml#L61

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Also how do I provide the k8 glue agent permissions to spin up/down ec2 nodes?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Are you able to do screenshare to discuss this? I'm not sure I understand the k8 glue agent purpose.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Made some progress getting the gpu nodes to provision, but got this error on my task K8S glue status: Unschedulable (0/4 nodes are available: 1 node(s) had taint { http://nvidia.com/gpu : true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Okay fixed that taint restriction

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

My next question, how do I add more queues?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good

Nice!

My next question, how do I add more queues?

You can create new queues in the UI and spin a new glue for the queue (basically think of a queue as an abstraction for a specific type of resource)
Make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes, I see in the UI how to create a new queue. How do I associate that queue with a nodeSelector though?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Also, how do I associate that new queue with a worker?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

For example, in my agent helm yaml, I have
` queue: default

podTemplate:
nodeSelector:
purpose: gpu-nvidia-t4-c8-m32-g1-od `

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

How would I do similar with a new queue

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Basically just change the helm yaml
queue: my_second_queue_name_here

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Also I'd like to create the queues pragmatically, is that possible?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

In other words, I'd like to create 3 queues via helm install. Each queue has its own podTemplate Is this possible?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?

Same process?!

Also I'd like to create the queues pragmatically, is that possible?

Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue to the agent execution commandline

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The agents are docker containers, how do I modify the startup script so it creates a queue? It seems like having additional queues beyond default is not handled by helm installs?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

The agents are docker containers, how do I modify the startup script so it creates a queue?

Hmm actually not sure about that, might not be part of the helm chart.
So maybe the easiest is:
from clearml.backend_api.session.client import APIClient c = APIClient() c.queues.create(name="new_queue")

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay, seems like there are ways to do it, just need to be a bit clever

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Yep 🙂
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yea, does the enterprise version have more functionality like this?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

yea, does the enterprise version have more functionality like this?

yes, all sorts of bit and pieces for easier DevOps / K8s etc.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

30 Answers

2 years ago