Hey AgitatedDove14 ,
The way
clearml
is designed, is to have queues abstract different types pf resources.
Configuring multiple queues and multiple agents based on the resources can be a solution for many use-cases . But when the instances are non-homogeneous, there can be too many combinations of resources based no.of GPUs, no. of cores, disk space etc. that work for various workloads. I’m thinking that creating as many agents and queues can get messy for managing as well as for the users who have to choose the right queue.
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.