Hello Everyone, I Deployed Clearml And Clearml-Agent On K8S And I Can'T Get The Agent To Use The Gpus. I Have 3 Machines With A Gpu Each Correctly Configured On The Cluster (Previously Used By

Answered

Hello everyone,

I deployed clearml and clearml-agent on k8s and I can't get the agent to use the gpus. I have 3 machines with a GPU each correctly configured on the cluster (previously used by determined.ai )

Whenever I launch a task, only the CPU is used. What could I be doing wrong?

Here's my values.yaml for the agent:

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: clearml-agent
  namespace: clearml
spec:
  chart:
    spec:
      sourceRef:
        kind: GitRepository
        name: flux-system
        namespace: flux-system
      chart: ../../../charts/clearml-agent-5.3.1.tgz
      reconcileStrategy: ChartVersion
  interval: 1m0s
  values:
    imageCredentials:
      enabled: true                   
      existingSecret: ""              
      registry: "***"
      username: "***"       
      password: "***"
      email: "***"        
    agentk8sglue:
      extraEnvs:
        - name: CLEARML_AGENT_UPDATE_VERSION
          value: "==1.9.2"
      apiServerUrlReference:


      fileServerUrlReference:


      webServerUrlReference:


      createQueueIfNotExists: false
      queue: default
      replicaCount: 1
      
      defaultContainerImage: "***" #my cuda based image
  
      image:
        registry: ""
        repository: "allegroai/clearml-agent-k8s-base"
        tag: "1.24-21"

      basePodTemplate:
        env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: all
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: compute,utility      
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1 #Do i even need this?
        nodeSelector:
          nvidia.com/gpu.present: "true"
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"

    clearml:
      agentk8sglueKey: "***"
      agentk8sglueSecret: "***"
      clearmlConfig: |-
        sdk {
            azure.storage {
                # max_connections: 2
                containers: [
                    {
                        account_name: "***"
                        account_key: "***"
                        container_name: "***"
                    }
                ]
            }
        }
    global:
      imageRegistry: "docker.io"

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					DecayedRaccoon75
				
					0
					 × 1

Votes Newest

Answers 3

@<1523701087100473344:profile|SuccessfulKoala55> any idea what might be wrong?

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					DecayedRaccoon75
				
					0
					 × 1

Hey @<1523701087100473344:profile|SuccessfulKoala55> ,

I have queue: default under agentk8sglue:
And I execute the task with task.execute_remotely(queue_name="default", exit_process=True)

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					DecayedRaccoon75
				
					0
					 × 1

Hi @<1805048176315469824:profile|DecayedRaccoon75> , what ClearML queue are you using to enqueue the tasks? Did you specify that queue in the agent chart values?

  				
Posted 
	8 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

3 Answers

8 months ago