Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone, I Deployed Clearml And Clearml-Agent On K8S And I Can'T Get The Agent To Use The Gpus. I Have 3 Machines With A Gpu Each Correctly Configured On The Cluster (Previously Used By

Hello everyone,

I deployed clearml and clearml-agent on k8s and I can't get the agent to use the gpus. I have 3 machines with a GPU each correctly configured on the cluster (previously used by determined.ai )

Whenever I launch a task, only the CPU is used. What could I be doing wrong?

Here's my values.yaml for the agent:

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: clearml-agent
  namespace: clearml
spec:
  chart:
    spec:
      sourceRef:
        kind: GitRepository
        name: flux-system
        namespace: flux-system
      chart: ../../../charts/clearml-agent-5.3.1.tgz
      reconcileStrategy: ChartVersion
  interval: 1m0s
  values:
    imageCredentials:
      enabled: true                   
      existingSecret: ""              
      registry: "***"
      username: "***"       
      password: "***"
      email: "***"        
    agentk8sglue:
      extraEnvs:
        - name: CLEARML_AGENT_UPDATE_VERSION
          value: "==1.9.2"
      apiServerUrlReference: 

      fileServerUrlReference: 

      webServerUrlReference: 

      createQueueIfNotExists: false
      queue: default
      replicaCount: 1
      
      defaultContainerImage: "***" #my cuda based image
  
      image:
        registry: ""
        repository: "allegroai/clearml-agent-k8s-base"
        tag: "1.24-21"

      basePodTemplate:
        env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: all
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: compute,utility      
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1 #Do i even need this?
        nodeSelector:
          nvidia.com/gpu.present: "true"
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"

    clearml:
      agentk8sglueKey: "***"
      agentk8sglueSecret: "***"
      clearmlConfig: |-
        sdk {
            azure.storage {
                # max_connections: 2
                containers: [
                    {
                        account_name: "***"
                        account_key: "***"
                        container_name: "***"
                    }
                ]
            }
        }
    global:
      imageRegistry: "docker.io"
  
  
Posted 15 days ago
Votes Newest

Answers 3


Hey SuccessfulKoala55 ,

I have queue: default under agentk8sglue:
And I execute the task with task.execute_remotely(queue_name="default", exit_process=True)

  
  
Posted 14 days ago

Hi DecayedRaccoon75 , what ClearML queue are you using to enqueue the tasks? Did you specify that queue in the agent chart values?

  
  
Posted 14 days ago

SuccessfulKoala55 any idea what might be wrong?

  
  
Posted 10 days ago
93 Views
3 Answers
15 days ago
10 days ago
Tags