Yeah, I don’t necessary want a traditional queuing system like in HPC clusters. I just want functional GPU management for users. As long as there is a job queue that works well, I can communicate the rest to the team for now.
I really want to avoid HPC craziness and keep things as simple as we can.
I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?
Totally within ClearML :the_horns: :the_horns:
I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?
Yes this is fully supported and very easy to setup.
Regrading limiting users usage. This is doable, I think the easiest solution both for users and management of the cluster is introducing priority into the queue, basically a user can push job into low priority, and only some users can push into higher priority queue, this removes the need for hardcoded limit and allows flexibility in terms of user resource usage. Obviously you can always rearrange jobs inside the queue
GrumpyPenguin23 and is priority queues something existing in clearML or would that require some external queuing solution like SLURM?
AgitatedDove14 With the GPU job management, do you know if you can limit usage per user. That is, could I limit a certain user to only using at most 2 GPUs at a time or only particular GPUs, as something similar to that?
This is great AgitatedDove14 . Thanks for the info on job queuing. That is one of the main things we are trying to enable. Do you work at Allegro, or is there someone else here I could talk with about Enterprise? I’m interested in the user management and permissions side of things along with the Data layer. Depending of course on pricing (because we are a non-profit).
There short answer is "definitely yes" but to get maximum usage you will probably want to setup priority queues
Hi there, Evangelist for ClearML here. What you are describing is a conventional provisioning solution such as SLURM. And that works with ClearML as well ☺ btw a s survivor of such provisioning schemes I don't think they are always worth it
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Interested in the differences between Enterprise and community. Who would I talk with?
Well, I'm not sure I'm the guy for that, but I think the gist is, enterprise (or paid) adds security / permissions, and expands data management layer (basically adding a query layer to the datasets , just like DB only with versioning and links to files). Obviously hosting, support etc, but I guess that is given