Hello Everyone! I’Ve Installed Clearml On My Kubernetes Cluster Using The Helm Chart. I Then Proceeded To Clone An Example Experiment (3D Plot Reporting) And Executed It, Expecting A K8S Job To Be Run, But Instead I Noticed That The Clearml-Agent Containe

Answered

Hello Everyone!
I’ve installed ClearML on my Kubernetes cluster using the helm chart.
I then proceeded to clone an example experiment (3d plot reporting) and executed it, expecting a k8s job to be run, but instead I noticed that the clearml-agent container executed the experiment within the pod.
I read in the documentation that there’s a component called k8s-glue that instructs ClearML to execute experiments as k8s jobs, but can’t find the documentation on how to enable/install it, any advice?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Votes Newest

Answers 31

SmugHippopotamus96 how did this setup work for you? are you using an autoscaling node group for the jobs?
with or without GPU?
Any additional tips on usage?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RoughTiger69
				
					0
					 × 1

https://github.com/allegroai/clearml-helm-charts/pull/54

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

I can verify that the pod is not removed, and a new one is not created when an experiment is reset and enqueued

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

I'll give it another try next week and keep you posted

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

SmugHippopotamus96 that's strange - the pod should be removed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Another possibile issue I encountered is when an experiment completes, it's pod is kept in the Complete phase, and when I Reset and Enqueue the experiment again, no new pod is created, the existing one it being updated but will never actually run the experiment again

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

As you'll probably run into issues as soon you want to start running experiments from private repos

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

No problem! Thank you for finding a bug in the chart 🤓

I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

Thank you very much! CostlyFox64 SuccessfulKoala55

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

❯ k get pod -w NAME READY STATUS RESTARTS AGE clearml-agent-group-cpu-agent-6d9cd7cf9f-hq2dl 1/1 Running 2 2d19h clearml-apiserver-7bbcb75496-64lm7 1/1 Running 6 2d19h clearml-elastic-master-0 1/1 Running 2 2d6h clearml-fileserver-68db5b6dd6-fkv4q 1/1 Running 2 2d19h clearml-id-f7cd2dc3584f443c9b7ae895b03e900c 0/1 ContainerCreating 0 9s clearml-k8sagent-7f584456c5-wf6wd 1/1 Running 0 3m26s clearml-mongodb-7698fc9f84-xbfhf 1/1 Running 2 2d19h clearml-redis-master-0 1/1 Running 2 2d6h clearml-webserver-55bdc98c74-ghpv4 1/1 Running 3 2d19h

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Looks like its working now!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

now waiting for the newer pod to start

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Was a mistake on my end, added an extra ] by accident

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

What is the error?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

trying to make sense of it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Now I get a different one

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Yes, I have

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Could be the cause of your error

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

Because if not, the k8sagent pod is still using the old version

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

Oh btw, did you restart the k8sagent pod after applying the new template?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

I’ll try and remove the newline for tolerations and nodeSelector

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

The configmap shows this
❯ k get configmaps k8sagent-pod-template -oyaml apiVersion: v1 data: template.yaml: "apiVersion: v1\nmetadata:\n namespace: \nspec:\n containers:\n \ - resources:\n {}\n env: \n - name: CLEARML_API_HOST\n value: \" \"\n - name: CLEARML_WEB_HOST\n value: \" \"\n - name: CLEARML_FILES_HOST\n value: \" \"\n \ - name: CLEARML_API_ACCESS_KEY\n valueFrom:\n secretKeyRef:\n \ name: clearml-conf\n key: apiserver_key\n - name: CLEARML_API_SECRET_KEY\n \ valueFrom:\n secretKeyRef:\n name: clearml-conf\n key: apiserver_secret\n tolerations:\n []\n nodeSelector:\n {}\n" kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: clearml meta.helm.sh/release-namespace: clearml creationTimestamp: "2022-02-02T10:25:25Z" labels: app.kubernetes.io/managed-by: Helm name: k8sagent-pod-template namespace: clearml resourceVersion: "4241060" uid: aec0e958-8ce9-4dfc-bd88-11a8b78bfdc1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L313

Should have been tolerations: [] , I'll send a PR soon to fix it.

In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

Ah I see it! I made a mistake in the helm chart 🙈

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyFox64
				
					0
					 × 1

and the spec.tolerations field there is defined as a map where it should be a slice

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

it has a partial pod template mapped to templates/template.yaml

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

No, I see that within the k8s-agent pod when it tries to execute the experiment

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

So you see the issue - it's with the k8s glue pod spec?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

and the k8s agent is configured to listen on that queue (see above)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugHippopotamus96
				
					0
					 × 1

Show more results

Write your answer

20K Views

31 Answers

2 years ago

7 months ago