Hi @<1523701070390366208:profile|CostlyOstrich36> ,
I tried setting up in the clearml-agent helm chart values requests & limits under the k8sGlue configuration in order to force the pods to pick up the gpu from the server, while of course choosing a pod image for the k8s jobs that includes a gpu in it (we're using nvidia/cuda:12.4.1 for testing)
the job is created - but simply can't detect a GPU. attaching the value overrides im using for the chart -
agentk8sglue:
apiServerUrlReference: "http://<server-ip>:30008"
fileServerUrlReference: "http://<server-ip>:30081"
webServerUrlReference: "http://<server-ip>:30080"
queue: "qubo-emulator"
replicaCount: 1
basePodTemplate:
resource:
requests:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
clearml:
{
"agentk8sglueKey": "<key>",
"agentk8sglueSecret": "<secret>"
}
sessions:
svcType: "NodePort"
startingPort: 30000
maxServices: 20
externalIP: "<node's IP>"
@<1710827340621156352:profile|HungryFrog27> have you installed the Nvidia gpu-operator to advertise GPUs to Kubernetes?
Hi @<1710827340621156352:profile|HungryFrog27> , what seems to be the issue?
Not yet, I tried making it work manually. Might give it a try, thanks!