Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Several Changes Occurred Recently And I Would Like To Know If There'S A Way To Verbose Catch All The Printout That Happening Within A K8S Glue Spawned Pod. We Have An Issue Where All Of Our New Remote_Execution Tasks Are Stuck In The 'Pending' Stage.

Hi, several changes occurred recently and i would like to know if there's a way to verbose catch all the printout that happening within a k8s glue spawned pod. We have an issue where all of our new remote_execution tasks are stuck in the 'pending' stage. The glue seems to spawn the Pod just fine, but a kubectl logs -f clearml-id-xxxxxxxxxx shows the pod only performed an apt install and a pip install and hung there upon a successful pip install. We need more verbose logging to see what's going on,

Based on comparison of logs from previously successful tasks, It would appear that the following should be printed next, but its not.
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.xxxxxx.cfg)
Changes in past 2 days
We are using Clearml with K8S glue and i noted new configuration in clearml-agent=1.1.0 as follows.docker_internal_mounts { sdk_cache:"/clearml_agent cache".... ... }
2. We changed api.file_server in both client and agent clearml.conf to s3://ecs.ai/bucket/folder

  1. We have had a low disk space issue with elasticsearch spewing an 'insufficient disk space' error but we have since increased the disk space 2 fold with no more errors.

  2. our apt repo has issues and apt update will give problems.

Any ideas?

  
  
Posted 3 years ago
Votes Newest

Answers 30


Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.

  
  
Posted 3 years ago

How do you need to pass it?

  
  
Posted 3 years ago

You can actually inspect the pod and see its spec, so you know when the pod is trying to run

  
  
Posted 3 years ago

Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.

  
  
Posted 3 years ago

Ok i get the logic now. extra_docker_shell_script executes before clearml-agent talks to clearml server.

  
  
Posted 3 years ago

Is the Glue significant in initialising clearml-agent after the pod is spawned?

Nope - once the pod is spawned the glue only monitors it externally using kubectl - the same way you would, and will only clean it up if the task was explicitly aborted by the user.

  
  
Posted 3 years ago

Do you run the glue as a script somewhere?

  
  
Posted 3 years ago

does the bash script need clearml-agent to be able to communicate to the https clearml-server first? If yes, there's a chicken/egg problem here.

  
  
Posted 3 years ago

I have since ruled out the apt and pypi repos. Both of them are installing properly on the pods.

  
  
Posted 3 years ago

What exactly do you need to pass to the pod for the self-signed cert?

  
  
Posted 3 years ago

however, it's possible the k8s glue code doesn't show enough information - it uses the kubectl inspect call, I think

  
  
Posted 3 years ago

The k8s glue should be monitoring the pod and updating the status_message of the task accordingly

  
  
Posted 3 years ago

This mean the ClearML Agent had not started, hence you have limited visibility (as there's no agent there to report any logging to the UI)

  
  
Posted 3 years ago

No no, I just meant that any sort of file or information you need to inject to the pod can be done in the bash script, before the agent will try to communicate with the server

  
  
Posted 3 years ago

The best thing to do it understand why the pod is hanging (can it be related to your apt repo? do you maybe have your own pypi repo?), and enhance the k8s glue to it can detect it and report it correctly

  
  
Posted 3 years ago

I want to rule out the glue being the problem. Is the Glue significant in initialising clearml-agent after the pod is spawned?

  
  
Posted 3 years ago

It writes its output to stdout (wherever you've ran it from), and tried to detect pod status and update it in the task's status_message field (in the task's general panel)

  
  
Posted 3 years ago

Nope, in the k8s glue, the config file is passed to the agent in the pod using a base64-encoded string - you can see it in the pod's command spec as one of the lines that looks something like echo '...' | base64 --decode >> ~/clearml.conf - it's injected on startup to the ~/clearml.conf file (you can actually copy the base64-encoded string from the spec and decode it yourself if you want to see what's in there)

  
  
Posted 3 years ago

Yeah 🙂

  
  
Posted 3 years ago

I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.

  
  
Posted 3 years ago

Sorry, in case i misunderstood you. Are you refering to the extra_docker_shell_script .

  
  
Posted 3 years ago

You can fully control the bash script executed when the pod starts

  
  
Posted 3 years ago

Its running as a long running POD on K8S. I'm using 

log -f

 to track its stdout.

Yeah, so that's the way to get it's log and output 🙂

  
  
Posted 3 years ago

ok. Any idea what can go on between the setting up of clearml-agent and initialising the clearml-agent itself? Does the clearml-agent try to communicate with any internet address. From another perspective, it looks like a long time out issue. I happen to be deploying on a disconnected on-premise setup.

  
  
Posted 3 years ago

Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg an issue?

  
  
Posted 3 years ago

Its running as a long running POD on K8S. I'm using log -f to track its stdout.

  
  
Posted 3 years ago

Well, the agent does try to communicate with the ClearML Server...

  
  
Posted 3 years ago

Is there a way for k8s glue to pass on self signed cert information to the agent pods?

  
  
Posted 3 years ago

Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.

Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.

There's a secondary issue resulting, i will put this on a new thread.

  
  
Posted 3 years ago

Something there gets stuck, obviously, but it's out of the glue's hands, so to speak

  
  
Posted 3 years ago