Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
In Order For A New Worker To Come Online In My K8 Cluster, Do I Need To Have An Ec2 Startup Script Init The Agent/Config, And Then Start The Daemon? Do I Have To Do This Manually Is This A Better Way?

In order for a new worker to come online in my k8 cluster, do I need to have an EC2 startup script init the agent/config, and then start the daemon? Do I have to do this manually is this a better way?

  
  
Posted 2 years ago
Votes Newest

Answers 30


Made some progress getting the gpu nodes to provision, but got this error on my task K8S glue status: Unschedulable (0/4 nodes are available: 1 node(s) had taint { http://nvidia.com/gpu : true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.)

  
  
Posted 2 years ago

yes, I see in the UI how to create a new queue. How do I associate that queue with a nodeSelector though?

  
  
Posted 2 years ago

In other words, I'd like to create 3 queues via helm install. Each queue has its own podTemplate Is this possible?

  
  
Posted 2 years ago

Yep 🙂
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?

  
  
Posted 2 years ago

Would I copy and paste this block to produce another queue and k8 glue agent?

  
  
Posted 2 years ago

For example, in my agent helm yaml, I have
` queue: default

podTemplate:
nodeSelector:
purpose: gpu-nvidia-t4-c8-m32-g1-od `

  
  
Posted 2 years ago

How would I do similar with a new queue

  
  
Posted 2 years ago

Yes, this is exactly how the clearml k8s glue works (notice the resource allocation, spin nodes up/down, is done by k8s which sometimes do take some time, if you only need "bare metal nodes" on the cloud, it might be more efficient to use the aws autoscaler, that essentially does the same thing

  
  
Posted 2 years ago

So that it spins up nodes

  
  
Posted 2 years ago

Are you able to do screenshare to discuss this? I'm not sure I understand the k8 glue agent purpose.

  
  
Posted 2 years ago

Also, how do I associate that new queue with a worker?

  
  
Posted 2 years ago

My next question, how do I add more queues?

  
  
Posted 2 years ago

How do I setup the clearml k8s glue?

  
  
Posted 2 years ago

Okay fixed that taint restriction

  
  
Posted 2 years ago

yea, does the enterprise version have more functionality like this?

  
  
Posted 2 years ago

Basically just change the helm yaml
queue: my_second_queue_name_here

  
  
Posted 2 years ago

The agents are docker containers, how do I modify the startup script so it creates a queue? It seems like having additional queues beyond default is not handled by helm installs?

  
  
Posted 2 years ago

I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good

  
  
Posted 2 years ago

So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?

  
  
Posted 2 years ago

No I'm not tracking. I'm pretty new to k8s so this might be beyond my current knowledge. Maybe if I rephrase my goals it may make more sense. Essentially I want to enqueue an experiment, pick a queue (gpu), and have a gpu ec2 node provisioned upon that, lastly the experiment is then initialized on that new gpu ec2 and executed. When the work is completed, I want the gpu ec2 node to terminate after x amount of time.

  
  
Posted 2 years ago

For instance, if I wanted the default queue and gpu queue that I create, how do I do that?

  
  
Posted 2 years ago

Also how do I provide the k8 glue agent permissions to spin up/down ec2 nodes?

  
  
Posted 2 years ago

yea, does the enterprise version have more functionality like this?

yes, all sorts of bit and pieces for easier DevOps / K8s etc.

  
  
Posted 2 years ago

The agents are docker containers, how do I modify the startup script so it creates a queue?

Hmm actually not sure about that, might not be part of the helm chart.
So maybe the easiest is:
from clearml.backend_api.session.client import APIClient c = APIClient() c.queues.create(name="new_queue")

  
  
Posted 2 years ago

Also I'd like to create the queues pragmatically, is that possible?

  
  
Posted 2 years ago

Okay, seems like there are ways to do it, just need to be a bit clever

  
  
Posted 2 years ago

So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?

Same process?!

Also I'd like to create the queues pragmatically, is that possible?

Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue to the agent execution commandline

  
  
Posted 2 years ago

I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good

Nice!

My next question, how do I add more queues?

You can create new queues in the UI and spin a new glue for the queue (basically think of a queue as an abstraction for a specific type of resource)
Make sense ?

  
  
Posted 2 years ago
1K Views
30 Answers
2 years ago
one year ago
Tags