Good Morning Folks, I Am Setting Up Clearml On A (Self-Hosted) K8S Cluster Using The

Answered

Good morning folks, I am setting up ClearML on a (self-hosted) K8s cluster using the https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml as the basis.
I managed to get the basic (authentication, replacing tokens etc.) and I would now like configure some agents to add to the cluster. In the https://github.com/allegroai/clearml-helm-charts/blob/4422cf433d3bf30699ae7094296a1eaa65fb3787/charts/clearml/values.yaml#L208 I see there are several parameters related to the agents, but I am not really sure it that fits our needs and what's the recommended way to proceed.

Our cluster has nodes with different GPUs, so that agent on different nodes will probably require different cuda-version images.

How would you recommend configuring the agents?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Votes Newest

Answers 11

What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Hi Martin, thanks. My doubt is:
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist? This step is really not clear to me from the documentation (it talks about user, and it uses interactive commands which would mean entering in the agents manually) I will try also the k8s glue, but I would like first to understand how to configure a fixed number of agents manually

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.

I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Correct, (if this is running on k8s it is most likely be passed via env variables , CLEARML_WEB_HOST etc,)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SarcasticSquirrel56

if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist?

Basically the agent register themselves on your cleaml-server, and they register on which Queue(s) they listen to. In other words the interface to choose the different types of machines/gpus is by enqueue the Task to different queues.
For example: Queue(1): "CUDA11_GPUx1" , Queue(2): "CUDA10_GPUx1"
Make sense ?

EDIT:

I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes

It might be the case, I have to admit I can't remember how flexible the helm chart is in this manner ...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So that agent on different nodes will probably require different cuda-version images.

That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Right now I see the default agent that comes with the helm chart...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Thanks Martin, so if I understand correctly, when I do the clearml-agent init command (I have to check the syntax), by providing the apiserver webeserver and fileserver url they'll be registered to the clearml cluster?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Thanks, I'll try to understand how the default agent coming with the helm chart is configured and try to copy how to setup a different one from there then

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

is how you would create different queues,

SarcasticSquirrel56 you can create them from the UI, when the server is already running
(if you are saying, how do I create them in the first installaiton, then yes you are correct, this is possible in the helm chart, I think 😞 )

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

11 Answers

3 years ago

2 years ago