For example, in my agent helm yaml, I have
` queue: default
podTemplate:
nodeSelector:
purpose: gpu-nvidia-t4-c8-m32-g1-od `
For instance, if I wanted the default queue and gpu queue that I create, how do I do that?
Would I copy and paste this block to produce another queue and k8 glue agent?
Also, how do I associate that new queue with a worker?
No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.
yes, I see in the UI how to create a new queue. How do I associate that queue with a nodeSelector though?
yea, does the enterprise version have more functionality like this?
yes, all sorts of bit and pieces for easier DevOps / K8s etc.
Also how do I provide the k8 glue agent permissions to spin up/down ec2 nodes?
The agents are docker containers, how do I modify the startup script so it creates a queue? It seems like having additional queues beyond default
is not handled by helm installs?
Are you able to do screenshare to discuss this? I'm not sure I understand the k8 glue agent purpose.
yea, does the enterprise version have more functionality like this?
Yep 🙂
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?
My next question, how do I add more queues?
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
Nice!
My next question, how do I add more queues?
You can create new queues in the UI and spin a new glue for the queue (basically think of a queue as an abstraction for a specific type of resource)
Make sense ?
The agents are docker containers, how do I modify the startup script so it creates a queue?
Hmm actually not sure about that, might not be part of the helm chart.
So maybe the easiest is:from clearml.backend_api.session.client import APIClient c = APIClient() c.queues.create(name="new_queue")
Okay, seems like there are ways to do it, just need to be a bit clever
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Same process?!
Also I'd like to create the queues pragmatically, is that possible?
Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue
to the agent execution commandline
Made some progress getting the gpu nodes to provision, but got this error on my task K8S glue status: Unschedulable (0/4 nodes are available: 1 node(s) had taint {
http://nvidia.com/gpu : true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.)
Yes, this is exactly how the clearml k8s glue works (notice the resource allocation, spin nodes up/down, is done by k8s which sometimes do take some time, if you only need "bare metal nodes" on the cloud, it might be more efficient to use the aws autoscaler, that essentially does the same thing
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Also I'd like to create the queues pragmatically, is that possible?
Basically just change the helm yamlqueue: my_second_queue_name_here
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
In other words, I'd like to create 3 queues via helm install. Each queue has its own podTemplate
Is this possible?