this obviously doesn't stop bad actors from assigning (say) an alice job over to bobs-queue but.. I am not sure of anyway to solve that.
I would think that a combination of kubernetes (I believe the preferred way to support multiple users at once, but open to being wrong) and individual queue's is probably the solution here.
for example; in kubernetes you could setup an agent to listen to bob-queue and another agent to listen to alice-queue. In the kubernetes dashboard you could assign a certain amount of cpu/memory and if using taints, gpu or not.
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
Hey AgitatedDove14 ,
The way
clearml
is designed, is to have queues abstract different types pf resources.
Configuring multiple queues and multiple agents based on the resources can be a solution for many use-cases . But when the instances are non-homogeneous, there can be too many combinations of resources based no.of GPUs, no. of cores, disk space etc. that work for various workloads. I’m thinking that creating as many agents and queues can get messy for managing as well as for the users who have to choose the right queue.
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Hi CharmingPuppy6
Basically yes there is.
The way clearml
is designed, is to have queues abstract different types pf resources. for example a queue for single gpu jobs (let's nam "single_gpu") and a queue for dual gpu jobs (let's name it "single_gpu").
Then you spin agents on machines and have the agents pull jobs from specific queues based on the hardware they have. For example we can have a 4 GPU machine with 3 agents, one agent connect to 2xGPUs and pulling Tasks from the "dual_gpu" queue, and then two more agents each one connect with a single gpu, pulling Tasks from "single_gpu" queue. The same idea also scales if you connect it to a Kubernetes cluster isntead of bare-metal / cloud.
Does that answer the question ?
The idea of queues is not to let the users have too much freedom on the one hand and on the other allow for maximum flexibility & control.
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
CharmingPuppy6 wdyt ?
CharmingPuppy6 These threads may also be interesting for you: https://clearml.slack.com/archives/CTK20V944/p1614867532303700 https://clearml.slack.com/archives/CTK20V944/p1617963053397600
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace)
This will be quite easy to implement using the cleamrl k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code
Can I assume we are talking Kubernetes under the hood for the resource allocation?
yes
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
While i agree that over-detailing makes user experience more cumbersome, I’d want to leave it to the user to choose the level of granularity and set the rest of the parameters to default values or not part of the scheduling considerations. For. example, a user might have a task that has some CPU-intensive things and want to be able to choose the right set of resources.
I can definitely see your point from the "DevOps" perspective, but from the user perspective it put the "liability" on me to "optimize" the resource, which to me sounds a bit much to put on my tiny shoulders, I just have a general knowledge on what I need. For example lots of CPUs (because I know my process scales well with more cpus), or large memory (because I have an entire dataset in memory). Personally (and really only my personal perspective), I'd rather have the option to select from a limited list of options (which can amount to a list of queues).
Now from the DevOps perspective that means spinning multiple k8s glues (with diff template YAML each one) and connect each job template to a different queue, while naming the queues with the actual resources.
for example the following queues are quite self explanatory:
24cpu_128gb_hd1tb, 48cpu_128gb_hd1tb, 4v100_32cpu_256gb_hd2tb
This means that as a DevOps I can put a max limit per queue/resource and always hav ethe ability to add more (just my 2 cents)
This will be quite easy to implement using the clearml k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code
I’m pretty new to this. So, it’d be great if you can do that.
ReassuredTiger98 thanks for sharing those threads. I found them very insightful.