Hi, Several Changes Occurred Recently And I Would Like To Know If There'S A Way To Verbose Catch All The Printout That Happening Within A K8S Glue Spawned Pod. We Have An Issue Where All Of Our New Remote_Execution Tasks Are Stuck In The 'Pending' Stage.

Answered

Hi, several changes occurred recently and i would like to know if there's a way to verbose catch all the printout that happening within a k8s glue spawned pod. We have an issue where all of our new remote_execution tasks are stuck in the 'pending' stage. The glue seems to spawn the Pod just fine, but a kubectl logs -f clearml-id-xxxxxxxxxx shows the pod only performed an apt install and a pip install and hung there upon a successful pip install. We need more verbose logging to see what's going on,

Based on comparison of logs from previously successful tasks, It would appear that the following should be printed next, but its not.
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.xxxxxx.cfg)
Changes in past 2 days
We are using Clearml with K8S glue and i noted new configuration in clearml-agent=1.1.0 as follows.docker_internal_mounts { sdk_cache:"/clearml_agent cache".... ... }
2. We changed api.file_server in both client and agent clearml.conf to s3://ecs.ai/bucket/folder

We have had a low disk space issue with elasticsearch spewing an 'insufficient disk space' error but we have since increased the disk space 2 fold with no more errors.
our apt repo has issues and apt update will give problems.

Any ideas?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 30

Yeah 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This mean the ClearML Agent had not started, hence you have limited visibility (as there's no agent there to report any logging to the UI)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I want to rule out the glue being the problem. Is the Glue significant in initialising clearml-agent after the pod is spawned?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Its running as a long running POD on K8S. I'm using log -f to track its stdout.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

No no, I just meant that any sort of file or information you need to inject to the pod can be done in the bash script, before the agent will try to communicate with the server

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Sorry, in case i misunderstood you. Are you refering to the extra_docker_shell_script .

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg an issue?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

How do you need to pass it?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Ok i get the logic now. extra_docker_shell_script executes before clearml-agent talks to clearml server.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Its running as a long running POD on K8S. I'm using

log -f

to track its stdout.

Yeah, so that's the way to get it's log and output 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

The k8s glue should be monitoring the pod and updating the status_message of the task accordingly

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Something there gets stuck, obviously, but it's out of the glue's hands, so to speak

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

You can actually inspect the pod and see its spec, so you know when the pod is trying to run

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Nope, in the k8s glue, the config file is passed to the agent in the pod using a base64-encoded string - you can see it in the pod's command spec as one of the lines that looks something like echo '...' | base64 --decode >> ~/clearml.conf - it's injected on startup to the ~/clearml.conf file (you can actually copy the base64-encoded string from the spec and decode it yourself if you want to see what's in there)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Is the Glue significant in initialising clearml-agent after the pod is spawned?

Nope - once the pod is spawned the glue only monitors it externally using kubectl - the same way you would, and will only clean it up if the task was explicitly aborted by the user.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

however, it's possible the k8s glue code doesn't show enough information - it uses the kubectl inspect call, I think

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Do you run the glue as a script somewhere?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I have since ruled out the apt and pypi repos. Both of them are installing properly on the pods.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

You can fully control the bash script executed when the pod starts

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, the agent does try to communicate with the ClearML Server...

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

What exactly do you need to pass to the pod for the self-signed cert?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

does the bash script need clearml-agent to be able to communicate to the https clearml-server first? If yes, there's a chicken/egg problem here.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.

Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.

There's a secondary issue resulting, i will put this on a new thread.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

ok. Any idea what can go on between the setting up of clearml-agent and initialising the clearml-agent itself? Does the clearml-agent try to communicate with any internet address. From another perspective, it looks like a long time out issue. I happen to be deploying on a disconnected on-premise setup.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Is there a way for k8s glue to pass on self signed cert information to the agent pods?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

It writes its output to stdout (wherever you've ran it from), and tried to detect pod status and update it in the task's status_message field (in the task's general panel)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

The best thing to do it understand why the pod is hanging (can it be related to your apt repo? do you maybe have your own pypi repo?), and enhance the k8s glue to it can detect it and report it correctly

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

30 Answers

3 years ago

2 years ago