Hey Hey! My Team And I Are Currently Testing Using Clearml Agents For Running Experiments, So Far It Has Been Great And We Really Love The Whole Clearml Ecosystem!! However, There Is Something I Don'T Quite Understand. Basically We Have Two Clusters, A An

Answered

hey hey! my team and I are currently testing using clearml agents for running experiments, so far it has been great and we really love the whole ClearML ecosystem!! however, there is something I don't quite understand. Basically we have two clusters, A and B, and in each of them we spin up agents using the helm chart, and both are serving a "gpu" queue. The issue is that, when someone enqueues an experiment in this "gpu" queue, it often happens that, if agent A picks the job and it doesn't have resources available, then the experiment (pod) stays in a pending state, even if B has free resources. Is there any way to check which agent has resources available and run the experiment there??

  				
Posted 
	one year ago

					More  		
  Report
		
					HomelyRabbit25
				
					0
					 × 1

Votes Newest

Answers 4

thanks!

  				
Posted 
	one year ago

					More  		
  Report
		
					HomelyRabbit25
				
					0
					 × 1

hey SuccessfulKoala55 that seems to work, thanks! One thing that it's not yet that clear to me is, what would be the recommended way of running agents in kubernetes? As I understand there is the ClearML Agent Helm Chart, which uses the k8s glue code, and running a clearml-agent daemon inside a pod (that already has the gpus assigned to it). Which one is the preferred way? I see issues with both approaches, and personally I believe that the Helm Chart is the correct way, but I can be wrong

  				
Posted 
	one year ago

					More  		
  Report
		
					HomelyRabbit25
				
					0
					 × 1

The helm chart of definitely the recommend way and also fits k8s better 🙂

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi HomelyRabbit25 , in the k8s agent setup, and agent will pick up a task and create a pod for it as soon as it's able to. To limit this behavior, you can set a limit on the number of pods the agent can apply. For example, if each experiment uses a single GPU and cluster A has 8 GPUs, it would make sense to limit the number of pods (using the maxPods setting) for agent A to 8...

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

997 Views

4 Answers

one year ago