I did update it to clearml-agent 0.17.2 , however the issue still persists for this long-lasting service pod.
However, this issue is no more when trying to dynamically allocate pods using the Kubernetes Glue.k8s_glue_example.py
Hi DeliciousBluewhale87
clearml-agent 0.17.2 was just release with the fix, let me know if it works
Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.python3 -m pip install clearml-agent==0.17.2rc3
Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
DeliciousBluewhale87 could you restart the pod and ssh to the Host and make sure the folder /opt/clearml/agent
exists and there is not *.conf file in it ?
Nothing changed.. the clearml.conf is still as is (empty)
I just changed the yaml file of clearml-agent to get it to start with the above line.python3 -m pip install clearml-agent==0.17.2rc4
This is from my k8 cluster. Using the clearml helm package, I was able to set this up.
clearml-agent deployment file
What do you mean by that? is that the helm of the agent ?
For the clearml-agent deployment file, I updated this linepython3 -m pip install clearml-agent==0.17.2rc4
and restarted the deployment. However the conf file is still empty.
Should I also update the clearml-agent-services as well in the clearml-agent-services deployment file ?
Ohh okay something seems to half work in terms of configuration, the agent has enough configuration to register itself, but fails to pass it to the task.
Can you test with the latest agent RC:0.17.2rc4
Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..
It might be that the worker was killed before unregistered, you will see it there but the last update will be stuck (after 10min it will be automatically removed)
Something is wierd.. It is showing workers which are not running now...
Is the agent itself registered on the clearml-server (a.k.a can you see it in the UI?)
I just checked the /root/clearml.conf file and it just containssdk{ }
DeliciousBluewhale87
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there..
Hmm that means it is working...
Do you see there a *.conf files? What do they contain? (it point to the correct clearml-server config)
Hi martin, i just untemplate-ed thehelm template clearml-server-chart-0.17.0+1.tgz
I found this lines inside.- name: CLEARML_AGENT_DOCKER_HOST_MOUNT value: /opt/clearml/agent:/root/.clearml
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there.. So the mounting worked, it seems.
I am not sure, I get your answer. Should i change the values to something else ?
Thanks
Hi DeliciousBluewhale87
My theory is that the clearml-agent is configured correctly (which means you see it in the clearml-server). The issue (I think) is that the Task itself (running inside the docker) is missing the configuration. The way the agent passes the configuration into the docker is by mapping a temporary configuration file into the docker itself. If the agent is running bare-metal, this is quite straight forward. If the agent is running on k8s (or basically inside a docker) then the agent needs:
Mapping of the docker socket Mapping of a Host folder into the agent's docker(1) Is used to actually execute docker run
, while (2) is used to pass information (a.k.a configuration files) from the Agent's docker into the Task's docker.
The CLEARML_AGENT_DOCKER_HOST_MOUNT
environment is the one that tells the Agents how it can pass these config files:
You can see in the example here:
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L144
We also have to mount a folder :
so that the docker will be able to mount the config files into the docker
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L147
Notice that this is not actually a PVC as there is no need for persistency, this is just a way to run a sibling docker.
Make sense?
Can you try removing the port from the webhost?
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host
hi FriendlySquid61 , The clearml-agent got filled up due to values.yaml file. However, agentservices was empty so I filled it up manually..
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
Yup, i used the value file for the agent. However, i manually edited for the agentservices (as there was no example for it in the github).. Also I am not sure what is the CLEARML_HOST_IP (left it empty)