Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone! I’Ve Installed Clearml On My Kubernetes Cluster Using The Helm Chart. I Then Proceeded To Clone An Example Experiment (3D Plot Reporting) And Executed It, Expecting A K8S Job To Be Run, But Instead I Noticed That The Clearml-Agent Containe

Hello Everyone!
I’ve installed ClearML on my Kubernetes cluster using the helm chart.
I then proceeded to clone an example experiment (3d plot reporting) and executed it, expecting a k8s job to be run, but instead I noticed that the clearml-agent container executed the experiment within the pod.
I read in the documentation that there’s a component called k8s-glue that instructs ClearML to execute experiments as k8s jobs, but can’t find the documentation on how to enable/install it, any advice?

  
  
Posted 2 years ago
Votes Newest

Answers 31


SmugHippopotamus96 how did this setup work for you? are you using an autoscaling node group for the jobs?
with or without GPU?
Any additional tips on usage?

  
  
Posted 2 years ago

SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!

  
  
Posted 2 years ago

I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued

  
  
Posted 2 years ago

I'll give it another try next week and keep you posted

  
  
Posted 2 years ago

SmugHippopotamus96 that's strange - the pod should be removed

  
  
Posted 2 years ago

Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again

  
  
Posted 2 years ago

As you'll probably run into issues as soon you want to start running experiments from private repos

  
  
Posted 2 years ago

No problem! Thank you for finding a bug in the chart 🤓

I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!

  
  
Posted 2 years ago

Thank you very much! CostlyFox64 SuccessfulKoala55

  
  
Posted 2 years ago

❯ k get pod -w NAME READY STATUS RESTARTS AGE clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h clearml-elastic-master-0 1/1 Running 2 2d6h clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h clearml-id-f7cd2dc3584f443c9b7ae895b03e900c 0/1 ContainerCreating 0 9s clearml-k8sagent-7f584456c5-wf6wd 1/1 Running 0 3m26s clearml-mongodb-7698fc9f84-xbfhf 1/1 Running 2 2d19h clearml-redis-master-0 1/1 Running 2 2d6h clearml-webserver-55bdc98c74-ghpv4 1/1 Running 3 2d19h

  
  
Posted 2 years ago

Looks like its working now!

  
  
Posted 2 years ago

now waiting for the newer pod to start

  
  
Posted 2 years ago

Was a mistake on my end, added an extra ] by accident

  
  
Posted 2 years ago

What is the error?

  
  
Posted 2 years ago

trying to make sense of it

  
  
Posted 2 years ago

Now I get a different one

  
  
Posted 2 years ago

Yes, I have

  
  
Posted 2 years ago

Could be the cause of your error

  
  
Posted 2 years ago

Because if not, the k8sagent pod is still using the old version

  
  
Posted 2 years ago

Oh btw, did you restart the k8sagent pod after applying the new template?

  
  
Posted 2 years ago

I’ll try and remove the newline for tolerations and nodeSelector

  
  
Posted 2 years ago

The configmap shows this
❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \" \"\n - name: CLEARML_WEB_HOST\n value: \" \"\n - name: CLEARML_FILES_HOST\n value: \" \"\n \ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n \ name: clearml-conf\n key: apiserver_key\n - name: CLEARML_API_SECRET_KEY\n \ valueFrom:\n secretKeyRef:\n name: clearml-conf\n key: apiserver_secret\n tolerations:\n []\n nodeSelector:\n {}\n" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: clearml meta.helm.sh/release-namespace: clearml creationTimestamp: "2022-02-02T10:25:25Z" labels: app.kubernetes.io/managed-by: Helm name: k8sagent-pod-template namespace: clearml resourceVersion: "4241060" uid: aec0e958-8ce9-4dfc-bd88-11a8b78bfdc1

  
  
Posted 2 years ago

https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L313

Should have been tolerations: [] , I'll send a PR soon to fix it.

In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []

  
  
Posted 2 years ago

Ah I see it! I made a mistake in the helm chart 🙈

  
  
Posted 2 years ago

and the spec.tolerations field there is defined as a map where it should be a slice

  
  
Posted 2 years ago

it has a partial pod template mapped to templates/template.yaml

  
  
Posted 2 years ago

No, I see that within the k8s-agent pod when it tries to execute the experiment

  
  
Posted 2 years ago

So you see the issue - it's with the k8s glue pod spec?

  
  
Posted 2 years ago

and the k8s agent is configured to listen on that queue (see above)

  
  
Posted 2 years ago
20K Views
31 Answers
2 years ago
7 months ago
Tags