Okay, seems like there are ways to do it, just need to be a bit clever
yea, does the enterprise version have more functionality like this?
yes, all sorts of bit and pieces for easier DevOps / K8s etc.
Would I copy and paste this block to produce another queue and k8 glue agent?
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
Nice!
My next question, how do I add more queues?
You can create new queues in the UI and spin a new glue for the queue (basically think of a queue as an abstraction for a specific type of resource)
Make sense ?
Basically just change the helm yamlqueue: my_second_queue_name_here
Are you able to do screenshare to discuss this? I'm not sure I understand the k8 glue agent purpose.
Also how do I provide the k8 glue agent permissions to spin up/down ec2 nodes?
For example, in my agent helm yaml, I have
` queue: default
podTemplate:
nodeSelector:
purpose: gpu-nvidia-t4-c8-m32-g1-od `
yea, does the enterprise version have more functionality like this?
No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.
In other words, I'd like to create 3 queues via helm install. Each queue has its own podTemplate
Is this possible?
Made some progress getting the gpu nodes to provision, but got this error on my task K8S glue status: Unschedulable (0/4 nodes are available: 1 node(s) had taint {
http://nvidia.com/gpu : true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.)
Yep 🙂
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?
For instance, if I wanted the default queue and gpu queue that I create, how do I do that?
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Also, how do I associate that new queue with a worker?
The agents are docker containers, how do I modify the startup script so it creates a queue? It seems like having additional queues beyond default
is not handled by helm installs?
My next question, how do I add more queues?
yes, I see in the UI how to create a new queue. How do I associate that queue with a nodeSelector though?
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Same process?!
Also I'd like to create the queues pragmatically, is that possible?
Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue
to the agent execution commandline
Yes, this is exactly how the clearml k8s glue works (notice the resource allocation, spin nodes up/down, is done by k8s which sometimes do take some time, if you only need "bare metal nodes" on the cloud, it might be more efficient to use the aws autoscaler, that essentially does the same thing
Also I'd like to create the queues pragmatically, is that possible?
The agents are docker containers, how do I modify the startup script so it creates a queue?
Hmm actually not sure about that, might not be part of the helm chart.
So maybe the easiest is:from clearml.backend_api.session.client import APIClient c = APIClient() c.queues.create(name="new_queue")
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good