To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
The issue here you assume both are idle and you need global priority based on resource preference. I understand your scenario now, but it will only hold if enqueuing order is "optimal". For example, if machine Y is running a Task B that is about to be completed (e.g. in a minute) then still machine X will pick the new Task B, and again we end up in the scenario where Task A is waiting and machine Y is idle.
The solution you are looking for is global dynamic resource scheduling and moving jobs between resources, this is a very complicated task 🙂 and actually out of scope for
ClearML that said, you can check SLURM, which is the best HPC scheduling solution I'm aware of, and even there it will be hard to create a policy for such a scenario. The good news
clearml integrates with slurm, so you could have slurm run the scheduling and clearml as the "external interface". I have to warn in advance, managing a SLURM cluster is challenging.
Sure thing 🙂
BTW: ReassuredTiger98 this is definitely an interesting use case, and I think you can actually write some code to solve it if you like.
Basically let's followup on you setup:
Machine X: agent listening to queue A, B_machine_a *notice we have two agents here Machine Y: agent listening to queue B_machine_bNow we (the users) will push our jobs into queues A and B
Now we have a service that does the following:
see if we have a job in queue B check if machine Y is working, if not pull the job from B and push into B_machine_b. else: check if machine X is working, if not pull the job from B and push into B_machine_a.Now the easy solution is you are that service, and you manually select the queue based on what you see in the "workers" page in the UI.
Notice that from the UI you can always move Tasks from one queue to another.
No. Here is a better example. I have two types of workstations: Type X can execute tasks of type A and B. Type Y can execute tasks of type B. This could be the case if type X workstations have for example more VRAM, newer drivers, etc...
I have two queues. Queue A and Queue B. I submit tasks of type A to queue A and tasks of type B to queue B.
Here is what can happen:
Enqueue the first task of type B. Workstations of type X will run this task. Enqueue the second task of type A. Workstation of type Y cannot execute it (and is not listening to queue A), so wait for the first task to finish. Workstations of type X runs the second task
Here is what should happen (should start from 1., but when saving slack just continues the list):
Enqueue the first task of type B. Workstations of type Y will run this task. Enqueue the second task of type A. Workstation of type X will run the second task.
Agent's queue priory can be translated to the order the agent will pull jobs from.
Now let's assume we have two agents with priorities A,B for one and B,A for the other. If we only push a Task to queue A, and both agents are idle (implying queue B is empty), there is no guarantee which one will pull the job.
Does that make sense ?
What is the use-case you are trying to solve/optimize for ?
I see. Thank you very much. For my current problem giving priority according to queue priority would kinda solve it. For experimentation I will sometimes enqueue a task and then later enqueue a another one of a different kind, but what happens is that even though this could be trivially solved, I will have to wait for the first one to finish. I guess this is only a problem for people with small "clusters" where SLURM does not make sense, but no scheduling at all is also suboptimal.
However, I see your point about it being out of scope! Thank you very explaining. 🙂
Yes, albeit not actually "intercept" as the user will be able to directly put Task sin queues B_machine_a/B_machine_b , but any time the user is pushing Tasks into queue B, this service will pull it and push to the individual machines queue.
what do you think?