Is it possible in Clearml to somehow allocate resources so that maybe after running a number of Alice's tasks, Bob's task get processed (Like maybe Round robin fashion)
Hi DeliciousBluewhale87
A few options here:
set the agent with high / low priority queues. Make sure Alice pushes into low priority (aka HPO) then Bob can push into high priority when he needs. This makes a lot of sense when you have automation processes spinning many experiments. expanding (1) you could set different agents with different priorities (for example low compute machines pulling from low priority) etc. The idea is that per agent you can play around with priorities. Per agent you can pass a flag --order-fairness
which will basically pull in a round robin way from all the quques (instead of priority order). That means that if you have Alice Q and Bob Q, the agent will pull once from A then one from B and then again from A and so on. This will ensure that Bob will get a "fair" chance of executing a Task In the Enterprise edition there is actually quotas you can assign, per user/group/queue and limit Alice so it will not hog the queues 🙂
Hi AgitatedDove14 , Now we prefer to run dynamic agents instead usingpython3 k8s_glue_example.py
In this case, is it still possible to pass --order-fairness at the queue level or this is more of a Enterprise edition feature.
DeliciousBluewhale87 not on the opensource, for some reason it is not passed 😞
Could you explain the use case ?
The use case, is lets say i runpython k8s_glue_example.py --queue glue_q
And some guys pushes an hyperparameterization job with over 100 experiments to the glue_q, one minute later, I push a simple training job to glue_q.. But I will be forced to wait for the 100 experiments to finish.
DeliciousBluewhale87 You can havwe multiple queues for the k8s queuea in priory order:python k8s_glue_example.py --queue glue_q_high glue_q_low
Then if someone is doing 100 experiments (say HPO), then they push into the "glie_q_low" which means it will first pop Tasks from the high priority queue and if it is empty it will pop from the low priority queue.
Does that make sense ?
python3 k8s_glue_example.py --queue glue_high_q glue_low_q usage: k8s_glue_example.py [-h] [--queue QUEUE] [--ports-mode] [--num-of-services NUM_OF_SERVICES] [--base-port BASE_PORT] [--base-pod-num BASE_POD_NUM] [--gateway-address GATEWAY_ADDRESS] [--pod-clearml-conf POD_CLEARML_CONF] [--overrides-yaml OVERRIDES_YAML] [--template-yaml TEMPLATE_YAML] [--ssh-server-port SSH_SERVER_PORT] [--namespace NAMESPACE] k8s_glue_example.py: error: unrecognized arguments: glue_low_q
Looking at the source code, it also seems it doesnt accept multiple arguments..
DeliciousBluewhale87 my apologies you are correct 😞
We should probably add support for that, do you feel like adding a GitHub issue, so we do not forget?
Github Issue : https://github.com/allegroai/clearml-agent/issues/50
AgitatedDove14 , Have added the github issue as requested. Thanks for the help. 👍
Thanks DeliciousBluewhale87 ! greatly appreciated 🙂
Hi AgitatedDove14 , Just your reply on https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045Basically as jobs are pulled by order, they are pushed into the k8s, then if we hit the max k8s instance limit, we stop pulling jobs until a k8s job is completed, then we continue.
For this scenario,
k8s has an instance limit of 10 (let's say)
I run Optimization (it has about 100 jobs) but only the first 10 will be pulled in k8. After this, I run a single Deep Learning (DL) task.
After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
Qn : Optimization's 11th task or the DL task.
After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
The one in the highest queue (if you have multiple queues)
If you use fairness it will pull in round robin from all queues, (obviously inside every queue it is based on the order of jobs).
fyi, you can reorder the jobs inside the queue from the UI 🙂
DeliciousBluewhale87 wdyt?
AgitatedDove14 I am confused now.. Isnt this feature not available in the k8 glue ? Or is it going to be implemented ?
It is currently only enabled when using ports mode, it should be enabled by default , i.e a new feature :)
Is there any documentation on how, we can use this ports mode ? I didnt seem to find any.. Tks
Hi DeliciousBluewhale87
Hmm, good question.
Basically the idea is that if you have ingestion service on the pods (i.e. as part of the yaml template used by the k8s glue) you can specify to the glue what are the exposed ports, so it knows (1) what's the maximum of instances it can spin, e.g. one per port (2) it will set the external port number on the Task, so that the running agent/code will be aware of the exposed port.
A use case for it would be combing the clearml-session with the k8s glue.
clearml-session will spin a remote docker on the k8s cluster, but in order for it to work it needs a single TCP port forward from the ingestion point to the agent, it also needs to be aware of the port/address so it can actually SSH to the pod once it is up.
make sense ?
Hi AgitatedDove14
I am still not very clear on using this, even after looking at k8s_glue_example.py 's code
Is it possible to give a sample usage of how this works ?python k8s_glue_example.py --ports-mode --num-of-services
Another question, I am still not sure , how this resolves my original question.
https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which exists when running in clearml-agent version compared to k8s_glue_example version ?
Tks.
Regrading the limit interface, let me check I think this is worked on (i.e. nice interface that should be pushed in the next few days). Let me get back to you on this one.
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which exists when running in clearml-agent version compared to k8s_glue_example version ?
A bit of background on how the glue works:
It pulls jobs from the clearml queue, then it prepares a k8s job, and launches the k8s jobs on the cluster.
By default it will just pull any job on the queue and push it into the k8s, which means the k8s is responsible for order (which it has no actual priority/order).
In the limit mode (i.e. ports mode or any other limitation), the glue will Not pull jobs from the clearml
queue of the current number of running jobs in the k8s reached a limit. This means that the priority of the jobs is guaranteed (as they are not pulled from the queue), assuming all any k8s job will be "immediately" launched (as opposed to pending on the k8s pod limitation)
Make sense ?
Hi AgitatedDove14 , Thanks for the explanation .python k8s_glue_example.py --queue high_priority_q --ports-mode --num-of-services 10 python k8s_glue_example.py --queue low_priority_q --ports-mode --num-of-services 2
Would the above be a good way to simulate the below ?clearml-agent daemon --queue high_priority_q low_priority_q
DeliciousBluewhale87 Yes I think so, do notice that you might end up with maximum of 12 pods.
You can also do the following with max 10 nodes: (notice --queue can always get a list of nodes it will pull based on the order of the queues)python k8s_glue_example.py --queue high_priority_q low_priority_q --ports-mode --num-of-services 10