not just fairness but the scheduled workloads will be starved of resources if say someone run training which by default take all the available cpus
Hi PompousParrot44
Well this kind of control is tricky. If you don't mind processes "fighting over cpu" you can just spin two trains-agents in cpu-mode. It will work as long as they have a different TRAINS_WORKER_NAME
The other option (might be a bit of an overkill) is to use K8s, which will set the CPU % for the entire agent.
What do you think?
thanks... i was just wondering if i overlooked any config option for that... as cpu_set
might be possibility to for cpu
PompousParrot44 now that I think about it, you might be able to limit the cpu affinity, would that help?
i could use k8s but thats bit overkill currently
PompousParrot44 with pleasure. If during your search for a solution you come across something that solves it, and might integrate to the agent, do not hesitate to suggest it :)
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
So does that mean CPU only workloads?
Also are we afraid of fairness? (i.e. someone "taking" all the CPU for themselves)
i know its not magic... all linux subsystem underneath.. just to configure it in a way as needed 🙂 for now i think i will stick with current setup of cpu-only mode and co-ordinate with in the team. later one when need comes .. will see if we go for k8s or not
PompousParrot44 I see what you mean, yes multiple context switching might cause a bit of decline in performance. not sure how much though ... The alternative of course is to set cpu affinity... Anyhow if you do get there we can try to come up with something that makes sense, but at the end there is no magic there 🙂
not really, the OS will almost never allow for that, actually it is based on fairness and priority. we can set the entire agent to have the same low priority for all of them, then the OS will always take CPU when needed (most of the time it won't) and all the agents will split the CPU's among them, no one will get starved 🙂 With GPUs , it is a different story, there is no actual context switching or fairness mechanisms like in CPU
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
i guess i was not so clear may be.. say e.g. you running lightgbm model training, by default it will take all the cpus available on the box and will run that many threads, now another task got scheduled on the same box now you have 2x threads with same amount of CPU to schedule on. So yes the jobs will progress but the progression will not be the same due to context switches which will happen way more than say if we have allowed on 1/2x threads for each job
i don't need this right away.. i just wanted to know the possibility fo dividing the current machine into multiple workers... i guess if its not readily available then may be you guys can discuss to see if it makes sense to have it on roadmap..