Hi, Expanding On

Answered

Hi, Expanding On

Hi, expanding on ReassuredTiger98 ’s earlier question,
Let's say Alice is running an hyperparameter optimization task on a particular queue, and spins off hundreds of tasks, essentially hogging the queue. Bob who also pushes a task to this queue seconds later, is just made to wait for hours for Alice to finish. Is it possible in Clearml to somehow allocate resources so that maybe after running a number of Alice's tasks, Bob's task get processed (Like maybe Round robin fashion)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Votes Newest

Answers 23

Regrading the limit interface, let me check I think this is worked on (i.e. nice interface that should be pushed in the next few days). Let me get back to you on this one.

How will imposing an instance limit , prevent or allow --order-fairness feature for example, which exists when running in clearml-agent version compared to k8s_glue_example version ?

A bit of background on how the glue works:
It pulls jobs from the clearml queue, then it prepares a k8s job, and launches the k8s jobs on the cluster.
By default it will just pull any job on the queue and push it into the k8s, which means the k8s is responsible for order (which it has no actual priority/order).
In the limit mode (i.e. ports mode or any other limitation), the glue will Not pull jobs from the clearml queue of the current number of running jobs in the k8s reached a limit. This means that the priority of the jobs is guaranteed (as they are not pulled from the queue), assuming all any k8s job will be "immediately" launched (as opposed to pending on the k8s pod limitation)
Make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

DeliciousBluewhale87 You can havwe multiple queues for the k8s queuea in priory order:
python k8s_glue_example.py --queue glue_q_high glue_q_lowThen if someone is doing 100 experiments (say HPO), then they push into the "glie_q_low" which means it will first pop Tasks from the high priority queue and if it is empty it will pop from the low priority queue.
Does that make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

After it finishes the 1st Optimzation task, what's the next job which will be pulled ?

The one in the highest queue (if you have multiple queues)
If you use fairness it will pull in round robin from all queues, (obviously inside every queue it is based on the order of jobs).
fyi, you can reorder the jobs inside the queue from the UI 🙂
DeliciousBluewhale87 wdyt?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , Thanks for the explanation .
python k8s_glue_example.py --queue high_priority_q --ports-mode --num-of-services 10 python k8s_glue_example.py --queue low_priority_q --ports-mode --num-of-services 2Would the above be a good way to simulate the below ?
clearml-agent daemon --queue high_priority_q low_priority_q

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

DeliciousBluewhale87 my apologies you are correct 😞
We should probably add support for that, do you feel like adding a GitHub issue, so we do not forget?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I am confused now.. Isnt this feature not available in the k8 glue ? Or is it going to be implemented ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Hi DeliciousBluewhale87
Hmm, good question.
Basically the idea is that if you have ingestion service on the pods (i.e. as part of the yaml template used by the k8s glue) you can specify to the glue what are the exposed ports, so it knows (1) what's the maximum of instances it can spin, e.g. one per port (2) it will set the external port number on the Task, so that the running agent/code will be aware of the exposed port.
A use case for it would be combing the clearml-session with the k8s glue.
clearml-session will spin a remote docker on the k8s cluster, but in order for it to work it needs a single TCP port forward from the ingestion point to the agent, it also needs to be aware of the port/address so it can actually SSH to the pod once it is up.
make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , Just your reply on https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045
Basically as jobs are pulled by order, they are pushed into the k8s, then if we hit the max k8s instance limit, we stop pulling jobs until a k8s job is completed, then we continue.For this scenario,
k8s has an instance limit of 10 (let's say)
I run Optimization (it has about 100 jobs) but only the first 10 will be pulled in k8. After this, I run a single Deep Learning (DL) task.
After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
Qn : Optimization's 11th task or the DL task.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

DeliciousBluewhale87 Yes I think so, do notice that you might end up with maximum of 12 pods.
You can also do the following with max 10 nodes: (notice --queue can always get a list of nodes it will pull based on the order of the queues)
python k8s_glue_example.py --queue high_priority_q low_priority_q --ports-mode --num-of-services 10

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , Now we prefer to run dynamic agents instead using
python3 k8s_glue_example.pyIn this case, is it still possible to pass --order-fairness at the queue level or this is more of a Enterprise edition feature.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Github Issue : https://github.com/allegroai/clearml-agent/issues/50
AgitatedDove14 , Have added the github issue as requested. Thanks for the help. 👍

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

DeliciousBluewhale87 not on the opensource, for some reason it is not passed 😞
Could you explain the use case ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is there any documentation on how, we can use this ports mode ? I didnt seem to find any.. Tks

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

kkie.. Will try it .. tks

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Sound Perfect. 👍

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

python3 k8s_glue_example.py --queue glue_high_q glue_low_q usage: k8s_glue_example.py [-h] [--queue QUEUE] [--ports-mode] [--num-of-services NUM_OF_SERVICES] [--base-port BASE_PORT] [--base-pod-num BASE_POD_NUM] [--gateway-address GATEWAY_ADDRESS] [--pod-clearml-conf POD_CLEARML_CONF] [--overrides-yaml OVERRIDES_YAML] [--template-yaml TEMPLATE_YAML] [--ssh-server-port SSH_SERVER_PORT] [--namespace NAMESPACE] k8s_glue_example.py: error: unrecognized arguments: glue_low_qLooking at the source code, it also seems it doesnt accept multiple arguments..

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

It is currently only enabled when using ports mode, it should be enabled by default , i.e a new feature :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sure tks.. can

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

DeliciousBluewhale87 let me check

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks DeliciousBluewhale87 ! greatly appreciated 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The use case, is lets say i run
python k8s_glue_example.py --queue glue_qAnd some guys pushes an hyperparameterization job with over 100 experiments to the glue_q, one minute later, I push a simple training job to glue_q.. But I will be forced to wait for the 100 experiments to finish.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Is it possible in Clearml to somehow allocate resources so that maybe after running a number of Alice's tasks, Bob's task get processed (Like maybe Round robin fashion)

Hi DeliciousBluewhale87
A few options here:
set the agent with high / low priority queues. Make sure Alice pushes into low priority (aka HPO) then Bob can push into high priority when he needs. This makes a lot of sense when you have automation processes spinning many experiments. expanding (1) you could set different agents with different priorities (for example low compute machines pulling from low priority) etc. The idea is that per agent you can play around with priorities. Per agent you can pass a flag --order-fairness which will basically pull in a round robin way from all the quques (instead of priority order). That means that if you have Alice Q and Bob Q, the agent will pull once from A then one from B and then again from A and so on. This will ensure that Bob will get a "fair" chance of executing a Task In the Enterprise edition there is actually quotas you can assign, per user/group/queue and limit Alice so it will not hog the queues 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14
I am still not very clear on using this, even after looking at k8s_glue_example.py 's code
Is it possible to give a sample usage of how this works ?
python k8s_glue_example.py --ports-mode --num-of-servicesAnother question, I am still not sure , how this resolves my original question.
https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which exists when running in clearml-agent version compared to k8s_glue_example version ?
Tks.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Write your answer

2K Views

23 Answers

4 years ago

2 years ago