Reputation
Badges 1
18 × Eureka!and the k8s agent is configured to listen on that queue (see above)
and the spec.tolerations
field there is defined as a map
where it should be a slice
I'll give it another try next week and keep you posted
I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued
` ❯ k get pod -w
NAME READY STATUS RESTARTS AGE
clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h
clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h
clearml-elastic-master-0 1/1 Running 2 2d6h
clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h
clearml-id-f7...
now waiting for the newer pod to start
Thank you very much! CostlyFox64 SuccessfulKoala55
it has a partial pod template mapped to templates/template.yaml
Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again
The configmap shows this❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \"
\"\n - name: CLEARML_WEB_HOST\n value: \"
\"\n - name: CLEARML_FILES_HOST\n value: \"
` "\n
\ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n
\ name: cl...
I’ll try and remove the newline for tolerations
and nodeSelector
Was a mistake on my end, added an extra ]
by accident
No, I see that within the k8s-agent
pod when it tries to execute the experiment