Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone! I’Ve Installed Clearml On My Kubernetes Cluster Using The Helm Chart. I Then Proceeded To Clone An Example Experiment (3D Plot Reporting) And Executed It, Expecting A K8S Job To Be Run, But Instead I Noticed That The Clearml-Agent Containe

Hello Everyone!
I’ve installed ClearML on my Kubernetes cluster using the helm chart.
I then proceeded to clone an example experiment (3d plot reporting) and executed it, expecting a k8s job to be run, but instead I noticed that the clearml-agent container executed the experiment within the pod.
I read in the documentation that there’s a component called k8s-glue that instructs ClearML to execute experiments as k8s jobs, but can’t find the documentation on how to enable/install it, any advice?

  
  
Posted 2 years ago
Votes Newest

Answers 31


image

  
  
Posted 2 years ago

Was a mistake on my end, added an extra ] by accident

  
  
Posted 2 years ago

No problem! Thank you for finding a bug in the chart 🤓

I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!

  
  
Posted 2 years ago

it has a partial pod template mapped to templates/template.yaml

  
  
Posted 2 years ago

and the spec.tolerations field there is defined as a map where it should be a slice

  
  
Posted 2 years ago

❯ k get pod -w NAME READY STATUS RESTARTS AGE clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h clearml-elastic-master-0 1/1 Running 2 2d6h clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h clearml-id-f7cd2dc3584f443c9b7ae895b03e900c 0/1 ContainerCreating 0 9s clearml-k8sagent-7f584456c5-wf6wd 1/1 Running 0 3m26s clearml-mongodb-7698fc9f84-xbfhf 1/1 Running 2 2d19h clearml-redis-master-0 1/1 Running 2 2d6h clearml-webserver-55bdc98c74-ghpv4 1/1 Running 3 2d19h

  
  
Posted 2 years ago

Ah I see it! I made a mistake in the helm chart 🙈

  
  
Posted 2 years ago

As you'll probably run into issues as soon you want to start running experiments from private repos

  
  
Posted 2 years ago

and the k8s agent is configured to listen on that queue (see above)

  
  
Posted 2 years ago

Yes, I have

  
  
Posted 2 years ago

Thank you very much! CostlyFox64 SuccessfulKoala55

  
  
Posted 2 years ago

No, I see that within the k8s-agent pod when it tries to execute the experiment

  
  
Posted 2 years ago

Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again

  
  
Posted 2 years ago

Oh btw, did you restart the k8sagent pod after applying the new template?

  
  
Posted 2 years ago

now waiting for the newer pod to start

  
  
Posted 2 years ago

What is the error?

  
  
Posted 2 years ago

Looks like its working now!

  
  
Posted 2 years ago

SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!

  
  
Posted 2 years ago

SmugHippopotamus96 that's strange - the pod should be removed

  
  
Posted 2 years ago

So you see the issue - it's with the k8s glue pod spec?

  
  
Posted 2 years ago

I'll give it another try next week and keep you posted

  
  
Posted 2 years ago

I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued

  
  
Posted 2 years ago

Because if not, the k8sagent pod is still using the old version

  
  
Posted 2 years ago

Could be the cause of your error

  
  
Posted 2 years ago

trying to make sense of it

  
  
Posted 2 years ago

Now I get a different one

  
  
Posted 2 years ago

The configmap shows this
❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \" \"\n - name: CLEARML_WEB_HOST\n value: \" \"\n - name: CLEARML_FILES_HOST\n value: \" \"\n \ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n \ name: clearml-conf\n key: apiserver_key\n - name: CLEARML_API_SECRET_KEY\n \ valueFrom:\n secretKeyRef:\n name: clearml-conf\n key: apiserver_secret\n tolerations:\n []\n nodeSelector:\n {}\n" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: clearml meta.helm.sh/release-namespace: clearml creationTimestamp: "2022-02-02T10:25:25Z" labels: app.kubernetes.io/managed-by: Helm name: k8sagent-pod-template namespace: clearml resourceVersion: "4241060" uid: aec0e958-8ce9-4dfc-bd88-11a8b78bfdc1

  
  
Posted 2 years ago

https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L313

Should have been tolerations: [] , I'll send a PR soon to fix it.

In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []

  
  
Posted 2 years ago

SmugHippopotamus96 how did this setup work for you? are you using an autoscaling node group for the jobs?
with or without GPU?
Any additional tips on usage?

  
  
Posted 2 years ago
22K Views
31 Answers
2 years ago
8 months ago
Tags