Hello everyone,
I deployed clearml and clearml-agent on k8s and I can't get the agent to use the gpus. I have 3 machines with a GPU each correctly configured on the cluster (previously used by determined.ai )
Whenever I launch a task, only the CPU is used. What could I be doing wrong?
Here's my values.yaml for the agent:
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: clearml-agent
namespace: clearml
spec:
chart:
spec:
sourceRef:
kind: GitRepository
name: flux-system
namespace: flux-system
chart: ../../../charts/clearml-agent-5.3.1.tgz
reconcileStrategy: ChartVersion
interval: 1m0s
values:
imageCredentials:
enabled: true
existingSecret: ""
registry: "***"
username: "***"
password: "***"
email: "***"
agentk8sglue:
extraEnvs:
- name: CLEARML_AGENT_UPDATE_VERSION
value: "==1.9.2"
apiServerUrlReference:
fileServerUrlReference:
webServerUrlReference:
createQueueIfNotExists: false
queue: default
replicaCount: 1
defaultContainerImage: "***" #my cuda based image
image:
registry: ""
repository: "allegroai/clearml-agent-k8s-base"
tag: "1.24-21"
basePodTemplate:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1 #Do i even need this?
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
clearml:
agentk8sglueKey: "***"
agentk8sglueSecret: "***"
clearmlConfig: |-
sdk {
azure.storage {
# max_connections: 2
containers: [
{
account_name: "***"
account_key: "***"
container_name: "***"
}
]
}
}
global:
imageRegistry: "docker.io"