Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone! I’Ve Installed Clearml On My Kubernetes Cluster Using The Helm Chart. I Then Proceeded To Clone An Example Experiment (3D Plot Reporting) And Executed It, Expecting A K8S Job To Be Run, But Instead I Noticed That The Clearml-Agent Containe

Hello Everyone!
I’ve installed ClearML on my Kubernetes cluster using the helm chart.
I then proceeded to clone an example experiment (3d plot reporting) and executed it, expecting a k8s job to be run, but instead I noticed that the clearml-agent container executed the experiment within the pod.
I read in the documentation that there’s a component called k8s-glue that instructs ClearML to execute experiments as k8s jobs, but can’t find the documentation on how to enable/install it, any advice?

  
  
Posted one year ago
Votes Newest

Answers 30


trying to make sense of it

  
  
Posted one year ago

Now I get a different one

  
  
Posted one year ago

Yes, I have

  
  
Posted one year ago

Thank you very much! CostlyFox64 SuccessfulKoala55

  
  
Posted one year ago

Oh btw, did you restart the k8sagent pod after applying the new template?

  
  
Posted one year ago

❯ k get pod -w NAME READY STATUS RESTARTS AGE clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h clearml-elastic-master-0 1/1 Running 2 2d6h clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h clearml-id-f7cd2dc3584f443c9b7ae895b03e900c 0/1 ContainerCreating 0 9s clearml-k8sagent-7f584456c5-wf6wd 1/1 Running 0 3m26s clearml-mongodb-7698fc9f84-xbfhf 1/1 Running 2 2d19h clearml-redis-master-0 1/1 Running 2 2d6h clearml-webserver-55bdc98c74-ghpv4 1/1 Running 3 2d19h

  
  
Posted one year ago

Was a mistake on my end, added an extra ] by accident

  
  
Posted one year ago

Looks like its working now!

  
  
Posted one year ago

What is the error?

  
  
Posted one year ago

I’ll try and remove the newline for tolerations and nodeSelector

  
  
Posted one year ago

now waiting for the newer pod to start

  
  
Posted one year ago

No, I see that within the k8s-agent pod when it tries to execute the experiment

  
  
Posted one year ago

Ah I see it! I made a mistake in the helm chart 🙈

  
  
Posted one year ago

it has a partial pod template mapped to templates/template.yaml

  
  
Posted one year ago

So you see the issue - it's with the k8s glue pod spec?

  
  
Posted one year ago

The configmap shows this
❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \" \"\n - name: CLEARML_WEB_HOST\n value: \" \"\n - name: CLEARML_FILES_HOST\n value: \" \"\n \ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n \ name: clearml-conf\n key: apiserver_key\n - name: CLEARML_API_SECRET_KEY\n \ valueFrom:\n secretKeyRef:\n name: clearml-conf\n key: apiserver_secret\n tolerations:\n []\n nodeSelector:\n {}\n" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: clearml meta.helm.sh/release-namespace: clearml creationTimestamp: "2022-02-02T10:25:25Z" labels: app.kubernetes.io/managed-by: Helm name: k8sagent-pod-template namespace: clearml resourceVersion: "4241060" uid: aec0e958-8ce9-4dfc-bd88-11a8b78bfdc1

  
  
Posted one year ago

Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again

  
  
Posted one year ago

I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued

  
  
Posted one year ago

As you'll probably run into issues as soon you want to start running experiments from private repos

  
  
Posted one year ago

SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!

  
  
Posted one year ago

SmugHippopotamus96 how did this setup work for you? are you using an autoscaling node group for the jobs?
with or without GPU?
Any additional tips on usage?

  
  
Posted one year ago

and the spec.tolerations field there is defined as a map where it should be a slice

  
  
Posted one year ago

SmugHippopotamus96 that's strange - the pod should be removed

  
  
Posted one year ago

Could be the cause of your error

  
  
Posted one year ago

and the k8s agent is configured to listen on that queue (see above)

  
  
Posted one year ago

I'll give it another try next week and keep you posted

  
  
Posted one year ago

https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L313

Should have been tolerations: [] , I'll send a PR soon to fix it.

In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []

  
  
Posted one year ago

Because if not, the k8sagent pod is still using the old version

  
  
Posted one year ago

No problem! Thank you for finding a bug in the chart 🤓

I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!

  
  
Posted one year ago
205 Views
30 Answers
one year ago
8 months ago
Tags