Hello Everyone! I’Ve Installed Clearml On My Kubernetes Cluster Using The Helm Chart. I Then Proceeded To Clone An Example Experiment (3D Plot Reporting) And Executed It, Expecting A K8S Job To Be Run, But Instead I Noticed That The Clearml-Agent Containe

Answered

Hello Everyone!
I’ve installed ClearML on my Kubernetes cluster using the helm chart.
I then proceeded to clone an example experiment (3d plot reporting) and executed it, expecting a k8s job to be run, but instead I noticed that the clearml-agent container executed the experiment within the pod.
I read in the documentation that there’s a component called k8s-glue that instructs ClearML to execute experiments as k8s jobs, but can’t find the documentation on how to enable/install it, any advice?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Votes Newest

Answers 31

trying to make sense of it

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L313

Should have been tolerations: [] , I'll send a PR soon to fix it.

In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

now waiting for the newer pod to start

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

and the k8s agent is configured to listen on that queue (see above)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Yes, I have

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

As you'll probably run into issues as soon you want to start running experiments from private repos

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

No, I see that within the k8s-agent pod when it tries to execute the experiment

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Now I get a different one

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Ah I see it! I made a mistake in the helm chart 🙈

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

I’ll try and remove the newline for tolerations and nodeSelector

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Because if not, the k8sagent pod is still using the old version

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

SmugHippopotamus96 how did this setup work for you? are you using an autoscaling node group for the jobs?
with or without GPU?
Any additional tips on usage?

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Was a mistake on my end, added an extra ] by accident

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

The configmap shows this
❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \" \"\n - name: CLEARML_WEB_HOST\n value: \" \"\n - name: CLEARML_FILES_HOST\n value: \" \"\n \ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n \ name: clearml-conf\n key: apiserver_key\n - name: CLEARML_API_SECRET_KEY\n \ valueFrom:\n secretKeyRef:\n name: clearml-conf\n key: apiserver_secret\n tolerations:\n []\n nodeSelector:\n {}\n" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: clearml meta.helm.sh/release-namespace: clearml creationTimestamp: "2022-02-02T10:25:25Z" labels: app.kubernetes.io/managed-by: Helm name: k8sagent-pod-template namespace: clearml resourceVersion: "4241060" uid: aec0e958-8ce9-4dfc-bd88-11a8b78bfdc1

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

So you see the issue - it's with the k8s glue pod spec?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

and the spec.tolerations field there is defined as a map where it should be a slice

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

No problem! Thank you for finding a bug in the chart 🤓

I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

Thank you very much! CostlyFox64 SuccessfulKoala55

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

it has a partial pod template mapped to templates/template.yaml

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

❯ k get pod -w NAME READY STATUS RESTARTS AGE clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h clearml-elastic-master-0 1/1 Running 2 2d6h clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h clearml-id-f7cd2dc3584f443c9b7ae895b03e900c 0/1 ContainerCreating 0 9s clearml-k8sagent-7f584456c5-wf6wd 1/1 Running 0 3m26s clearml-mongodb-7698fc9f84-xbfhf 1/1 Running 2 2d19h clearml-redis-master-0 1/1 Running 2 2d6h clearml-webserver-55bdc98c74-ghpv4 1/1 Running 3 2d19h

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

https://github.com/allegroai/clearml-helm-charts/pull/54

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

SmugHippopotamus96 that's strange - the pod should be removed

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Could be the cause of your error

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

What is the error?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

Oh btw, did you restart the k8sagent pod after applying the new template?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

I'll give it another try next week and keep you posted

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Show more results

Write your answer

63K Views

31 Answers

3 years ago

one year ago