however, it's possible the k8s glue code doesn't show enough information - it uses the  kubectl inspect  call, I think
I want to rule out the glue being the problem. Is the Glue significant in initialising clearml-agent after the pod is spawned?
The k8s glue should be monitoring the pod and updating the status_message of the task accordingly
Something there gets stuck, obviously, but it's out of the glue's hands, so to speak
Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.
You can actually inspect the pod and see its spec, so you know when the pod is trying to run
Well, the agent does try to communicate with the ClearML Server...
Its running as a long running POD on K8S. I'm using
log -f
to track its stdout.
Yeah, so that's the way to get it's log and output 🙂
Do you run the glue as a script somewhere?
I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.
No no, I just meant that any sort of file or information you need to inject to the pod can be done in the bash script, before the agent will try to communicate with the server
Its running as a long running POD on K8S. I'm using  log -f  to track its stdout.
ok. Any idea what can go on between the setting up of clearml-agent and initialising the clearml-agent itself? Does the clearml-agent try to communicate with any internet address. From another perspective, it looks like a long time out issue. I happen to be deploying on a disconnected on-premise setup.
Is there a way for k8s glue to pass on self signed cert information to the agent pods?
does the bash script need clearml-agent to be able to communicate to the https clearml-server first? If yes, there's a chicken/egg problem here.
Nope, in the k8s glue, the config file is passed to the agent in the pod using a base64-encoded string - you can see it in the pod's command spec as one of the lines that looks something like  echo '...' | base64 --decode >> ~/clearml.conf  - it's injected on startup to the  ~/clearml.conf  file (you can actually copy the base64-encoded string from the spec and decode it yourself if you want to see what's in there)
What exactly do you need to pass to the pod for the self-signed cert?
Sorry, in case i misunderstood you. Are you refering to the  extra_docker_shell_script  .
Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of  /tmp/.clearml_agent.xxxxxx.cfg  an issue?
Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.
Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.
There's a secondary issue resulting, i will put this on a new thread.
It writes its output to stdout (wherever you've ran it from), and tried to detect pod status and update it in the task's  status_message  field (in the task's general panel)
You can fully control the bash script executed when the pod starts
I have since ruled out the apt and pypi repos. Both of them are installing properly on the pods.
This mean the ClearML Agent had not started, hence you have limited visibility (as there's no agent there to report any logging to the UI)
Ok i get the logic now.  extra_docker_shell_script  executes before clearml-agent talks to clearml server.
Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.
The best thing to do it understand why the pod is hanging (can it be related to your apt repo? do you maybe have your own pypi repo?), and enhance the k8s glue to it can detect it and report it correctly
Is the Glue significant in initialising clearml-agent after the pod is spawned?
Nope - once the pod is spawned the glue only monitors it externally using  kubectl  - the same way you would, and will only clean it up if the task was explicitly aborted by the user.