Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.
Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.
There's a secondary issue resulting, i will put this on a new thread.
ok. Any idea what can go on between the setting up of clearml-agent and initialising the clearml-agent itself? Does the clearml-agent try to communicate with any internet address. From another perspective, it looks like a long time out issue. I happen to be deploying on a disconnected on-premise setup.
Well, the agent does try to communicate with the ClearML Server...
Is the Glue significant in initialising clearml-agent after the pod is spawned?
Nope - once the pod is spawned the glue only monitors it externally using kubectl
- the same way you would, and will only clean it up if the task was explicitly aborted by the user.
Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg
an issue?
Its running as a long running POD on K8S. I'm usingÂ
log -f
 to track its stdout.
Yeah, so that's the way to get it's log and output 🙂
I want to rule out the glue being the problem. Is the Glue significant in initialising clearml-agent after the pod is spawned?
You can actually inspect the pod and see its spec, so you know when the pod is trying to run
Something there gets stuck, obviously, but it's out of the glue's hands, so to speak
You can fully control the bash script executed when the pod starts
however, it's possible the k8s glue code doesn't show enough information - it uses the kubectl inspect
call, I think
The k8s glue should be monitoring the pod and updating the status_message of the task accordingly
Its running as a long running POD on K8S. I'm using log -f
to track its stdout.
Do you run the glue as a script somewhere?
Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.
I have since ruled out the apt and pypi repos. Both of them are installing properly on the pods.
Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.
This mean the ClearML Agent had not started, hence you have limited visibility (as there's no agent there to report any logging to the UI)
I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.
No no, I just meant that any sort of file or information you need to inject to the pod can be done in the bash script, before the agent will try to communicate with the server
does the bash script need clearml-agent to be able to communicate to the https clearml-server first? If yes, there's a chicken/egg problem here.
Sorry, in case i misunderstood you. Are you refering to the extra_docker_shell_script
.
What exactly do you need to pass to the pod for the self-signed cert?
Ok i get the logic now. extra_docker_shell_script
executes before clearml-agent talks to clearml server.
Is there a way for k8s glue to pass on self signed cert information to the agent pods?
It writes its output to stdout (wherever you've ran it from), and tried to detect pod status and update it in the task's status_message
field (in the task's general panel)
The best thing to do it understand why the pod is hanging (can it be related to your apt repo? do you maybe have your own pypi repo?), and enhance the k8s glue to it can detect it and report it correctly
Nope, in the k8s glue, the config file is passed to the agent in the pod using a base64-encoded string - you can see it in the pod's command spec as one of the lines that looks something like echo '...' | base64 --decode >> ~/clearml.conf
- it's injected on startup to the ~/clearml.conf
file (you can actually copy the base64-encoded string from the spec and decode it yourself if you want to see what's in there)