Reputation
Badges 1
18 × Eureka!No, I see that within the k8s-agent
pod when it tries to execute the experiment
I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued
I’ll try and remove the newline for tolerations
and nodeSelector
The configmap shows this❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \"
\"\n - name: CLEARML_WEB_HOST\n value: \"
\"\n - name: CLEARML_FILES_HOST\n value: \"
` "\n
\ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n
\ name: cl...
now waiting for the newer pod to start
Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again
it has a partial pod template mapped to templates/template.yaml
Thank you very much! CostlyFox64 SuccessfulKoala55
I'll give it another try next week and keep you posted
and the k8s agent is configured to listen on that queue (see above)
Was a mistake on my end, added an extra ]
by accident
` ❯ k get pod -w
NAME READY STATUS RESTARTS AGE
clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h
clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h
clearml-elastic-master-0 1/1 Running 2 2d6h
clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h
clearml-id-f7...
and the spec.tolerations
field there is defined as a map
where it should be a slice