Hey! Did Anyone Had Experience With Setting Up Clearml K8S-Based Agents To Create K8S Jobs Connected To The Node'S Gpu? Running K3S Over A Local Server Thanks, As This Is Currently Blocking Us

Answered

Hey!
Did anyone had experience with setting up clearml k8s-based agents to create k8s jobs connected to the node's gpu?
running k3s over a local server
Thanks, as this is currently blocking us

  				
Posted 
	14 days ago

					More
				  		
  Report
		
					HungryFrog27
				
					0
					 × 1

Votes Newest

Answers 4

Not yet, I tried making it work manually. Might give it a try, thanks!

  				
Posted 
	13 days ago

					More
				  		
  Report
		
					HungryFrog27
				
					0
					 × 1

@<1710827340621156352:profile|HungryFrog27> have you installed the Nvidia gpu-operator to advertise GPUs to Kubernetes?

  				
Posted 
	13 days ago

					More
				  		
  Report
		
					CooperativeKitten94
				
					0

Hi @<1710827340621156352:profile|HungryFrog27> , what seems to be the issue?

  				
Posted 
	14 days ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1523701070390366208:profile|CostlyOstrich36> ,
I tried setting up in the clearml-agent helm chart values requests & limits under the k8sGlue configuration in order to force the pods to pick up the gpu from the server, while of course choosing a pod image for the k8s jobs that includes a gpu in it (we're using nvidia/cuda:12.4.1 for testing)

the job is created - but simply can't detect a GPU. attaching the value overrides im using for the chart -

agentk8sglue:
          apiServerUrlReference: "http://<server-ip>:30008"
          fileServerUrlReference: "http://<server-ip>:30081"
          webServerUrlReference: "http://<server-ip>:30080"
          queue: "qubo-emulator"
          replicaCount: 1
          basePodTemplate:
            resource:
              requests:
                cpu: "2"
                memory: "4Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "2"
                memory: "4Gi"
                nvidia.com/gpu: "1"
        clearml:
          {
            "agentk8sglueKey": "<key>",
            "agentk8sglueSecret": "<secret>"
          }
        sessions:
          svcType: "NodePort"
          startingPort: 30000
          maxServices: 20
          externalIP: "<node's IP>"

  				
Posted 
	14 days ago

					More
				  		
  Report
		
					HungryFrog27
				
					0
					 × 1

Write your answer

74 Views

4 Answers

14 days ago

12 days ago