Hey! Did Anyone Had Experience With Setting Up Clearml K8S-Based Agents To Create K8S Jobs Connected To The Node'S Gpu? Running K3S Over A Local Server Thanks, As This Is Currently Blocking Us

Unanswered

Hi @<1523701070390366208:profile|CostlyOstrich36> ,
I tried setting up in the clearml-agent helm chart values requests & limits under the k8sGlue configuration in order to force the pods to pick up the gpu from the server, while of course choosing a pod image for the k8s jobs that includes a gpu in it (we're using nvidia/cuda:12.4.1 for testing)

the job is created - but simply can't detect a GPU. attaching the value overrides im using for the chart -

agentk8sglue:
          apiServerUrlReference: "http://<server-ip>:30008"
          fileServerUrlReference: "http://<server-ip>:30081"
          webServerUrlReference: "http://<server-ip>:30080"
          queue: "qubo-emulator"
          replicaCount: 1
          basePodTemplate:
            resource:
              requests:
                cpu: "2"
                memory: "4Gi"
                nvidia.com/gpu: "1"
              limits:
                cpu: "2"
                memory: "4Gi"
                nvidia.com/gpu: "1"
        clearml:
          {
            "agentk8sglueKey": "<key>",
            "agentk8sglueSecret": "<secret>"
          }
        sessions:
          svcType: "NodePort"
          startingPort: 30000
          maxServices: 20
          externalIP: "<node's IP>"

  				
Posted 
	11 months ago

					More
				  		
  Report
		
					HungryFrog27
				
					0
					 × 1

173 Views

0 Answers

11 months ago