Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone, I Deployed Clearml And Clearml-Agent On K8S And I Can'T Get The Agent To Use The Gpus. I Have 3 Machines With A Gpu Each Correctly Configured On The Cluster (Previously Used By

Hello everyone,

I deployed clearml and clearml-agent on k8s and I can't get the agent to use the gpus. I have 3 machines with a GPU each correctly configured on the cluster (previously used by determined.ai )

Whenever I launch a task, only the CPU is used. What could I be doing wrong?

Here's my values.yaml for the agent:

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: clearml-agent
  namespace: clearml
spec:
  chart:
    spec:
      sourceRef:
        kind: GitRepository
        name: flux-system
        namespace: flux-system
      chart: ../../../charts/clearml-agent-5.3.1.tgz
      reconcileStrategy: ChartVersion
  interval: 1m0s
  values:
    imageCredentials:
      enabled: true                   
      existingSecret: ""              
      registry: "***"
      username: "***"       
      password: "***"
      email: "***"        
    agentk8sglue:
      extraEnvs:
        - name: CLEARML_AGENT_UPDATE_VERSION
          value: "==1.9.2"
      apiServerUrlReference: 

      fileServerUrlReference: 

      webServerUrlReference: 

      createQueueIfNotExists: false
      queue: default
      replicaCount: 1
      
      defaultContainerImage: "***" #my cuda based image
  
      image:
        registry: ""
        repository: "allegroai/clearml-agent-k8s-base"
        tag: "1.24-21"

      basePodTemplate:
        env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: all
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: compute,utility      
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1 #Do i even need this?
        nodeSelector:
          nvidia.com/gpu.present: "true"
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"

    clearml:
      agentk8sglueKey: "***"
      agentk8sglueSecret: "***"
      clearmlConfig: |-
        sdk {
            azure.storage {
                # max_connections: 2
                containers: [
                    {
                        account_name: "***"
                        account_key: "***"
                        container_name: "***"
                    }
                ]
            }
        }
    global:
      imageRegistry: "docker.io"
  
  
Posted 14 days ago
Votes Newest

Answers 3


SuccessfulKoala55 any idea what might be wrong?

  
  
Posted 9 days ago

Hey SuccessfulKoala55 ,

I have queue: default under agentk8sglue:
And I execute the task with task.execute_remotely(queue_name="default", exit_process=True)

  
  
Posted 13 days ago

Hi DecayedRaccoon75 , what ClearML queue are you using to enqueue the tasks? Did you specify that queue in the agent chart values?

  
  
Posted 13 days ago
86 Views
3 Answers
14 days ago
9 days ago
Tags