I Just Deployed Clearml Into K8 Cluster Using Clearml Helm Package. When I Ran A Job, It Gave This Error In The Clearml Web Server (Attached Below). I Sshed Into The Pod Running The Clearml-Agent. Upon Typing Clearml-Agent Init, I Realised The Clearml.Con

Answered

I just deployed clearml into k8 cluster using clearml helm package. When i ran a job, it gave this error in the clearml web server (Attached below). I sshed into the pod running the clearml-agent. Upon typing clearml-agent init, i realised the clearml.conf is empty and also I am not able to configure it (it is write protected). I am only able to configure the clearml.conf in the agentservices node.
So i am unable to run a clearml-agent within the k8 system, however, things are working if the clearml-agent is outisde the k8 system (my laptop). Do let me know if there is some way to debug this ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Votes Newest

Answers 34

I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

hi FriendlySquid61 , The clearml-agent got filled up due to values.yaml file. However, agentservices was empty so I filled it up manually..

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

I just checked the /root/clearml.conf file and it just contains
sdk{ }

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Yup, tried that.. Same error also

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Yup, i used the value file for the agent. However, i manually edited for the agentservices (as there was no example for it in the github).. Also I am not sure what is the CLEARML_HOST_IP (left it empty)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Nothing changed.. the clearml.conf is still as is (empty)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.
python3 -m pip install clearml-agent==0.17.2rc3

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Ohh okay something seems to half work in terms of configuration, the agent has enough configuration to register itself, but fails to pass it to the task.
Can you test with the latest agent RC:
0.17.2rc4

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

DeliciousBluewhale87 could you restart the pod and ssh to the Host and make sure the folder /opt/clearml/agent exists and there is not *.conf file in it ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I just changed the yaml file of clearml-agent to get it to start with the above line.
python3 -m pip install clearml-agent==0.17.2rc4

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

By the way, are you editing the values directly? Why not use the values file?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

DeliciousBluewhale87 and is it working?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi martin, i just untemplate-ed the
helm template clearml-server-chart-0.17.0+1.tgzI found this lines inside.
- name: CLEARML_AGENT_DOCKER_HOST_MOUNT value: /opt/clearml/agent:/root/.clearmlUpon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there.. So the mounting worked, it seems.
I am not sure, I get your answer. Should i change the values to something else ?
Thanks

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Hi DeliciousBluewhale87
clearml-agent 0.17.2 was just release with the fix, let me know if it works

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

This is from my k8 cluster. Using the clearml helm package, I was able to set this up.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

DeliciousBluewhale87

Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there..

Hmm that means it is working...
Do you see there a *.conf files? What do they contain? (it point to the correct clearml-server config)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That's the agent-services one, can you check the agent's one?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Is the agent itself registered on the clearml-server (a.k.a can you see it in the UI?)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Something is wierd.. It is showing workers which are not running now...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

clearml-agent deployment file

What do you mean by that? is that the helm of the agent ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I did update it to clearml-agent 0.17.2 , however the issue still persists for this long-lasting service pod.
However, this issue is no more when trying to dynamically allocate pods using the Kubernetes Glue.
k8s_glue_example.py

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Can you try removing the port from the webhost?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

For the clearml-agent deployment file, I updated this line
python3 -m pip install clearml-agent==0.17.2rc4and restarted the deployment. However the conf file is still empty.

Should I also update the clearml-agent-services as well in the clearml-agent-services deployment file ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Show more results

Write your answer

125K Views

34 Answers

4 years ago

one year ago