SmugHippopotamus96 how did this setup work for you? are you using an autoscaling node group for the jobs?
with or without GPU?
Any additional tips on usage?
SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!
I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued
I'll give it another try next week and keep you posted
SmugHippopotamus96 that's strange - the pod should be removed
Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again
As you'll probably run into issues as soon you want to start running experiments from private repos
No problem! Thank you for finding a bug in the chart 🤓
I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!
Thank you very much! CostlyFox64 SuccessfulKoala55
❯ k get pod -w NAME READY STATUS RESTARTS AGE clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h clearml-elastic-master-0 1/1 Running 2 2d6h clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h clearml-id-f7cd2dc3584f443c9b7ae895b03e900c 0/1 ContainerCreating 0 9s clearml-k8sagent-7f584456c5-wf6wd 1/1 Running 0 3m26s clearml-mongodb-7698fc9f84-xbfhf 1/1 Running 2 2d19h clearml-redis-master-0 1/1 Running 2 2d6h clearml-webserver-55bdc98c74-ghpv4 1/1 Running 3 2d19h
now waiting for the newer pod to start
Was a mistake on my end, added an extra ]
by accident
Because if not, the k8sagent pod is still using the old version
Oh btw, did you restart the k8sagent pod after applying the new template?
I’ll try and remove the newline for tolerations
and nodeSelector
The configmap shows this❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \"
\"\n - name: CLEARML_WEB_HOST\n value: \"
\"\n - name: CLEARML_FILES_HOST\n value: \"
\"\n \ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n \ name: clearml-conf\n key: apiserver_key\n - name: CLEARML_API_SECRET_KEY\n \ valueFrom:\n secretKeyRef:\n name: clearml-conf\n key: apiserver_secret\n tolerations:\n []\n nodeSelector:\n {}\n" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: clearml meta.helm.sh/release-namespace: clearml creationTimestamp: "2022-02-02T10:25:25Z" labels: app.kubernetes.io/managed-by: Helm name: k8sagent-pod-template namespace: clearml resourceVersion: "4241060" uid: aec0e958-8ce9-4dfc-bd88-11a8b78bfdc1
Should have been tolerations: []
, I'll send a PR soon to fix it.
In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []
Ah I see it! I made a mistake in the helm chart 🙈
and the spec.tolerations
field there is defined as a map
where it should be a slice
it has a partial pod template mapped to templates/template.yaml
No, I see that within the k8s-agent
pod when it tries to execute the experiment
So you see the issue - it's with the k8s glue pod spec?
and the k8s agent is configured to listen on that queue (see above)