I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
SubstantialElk6 - As a side-note, since docker is about to be deprecated, sometime in the near future we plan to switch to another runtime. This actually means that the entire docker.sock issue will not be relevant very soon 🙂
If you want the agent to run in docker mode, the docker.sock should be exposed. But that's the only reason for this configuration.
What do you mean by ' not taking effect with the k8s glue '?
We are deploying ClearML Server via the docker-compose.
For ClearML-Agent. We have the choice of Docker or K8S preferred (Using the Glue).
For K8S, we can't get the glue to work ( https://clearml.slack.com/archives/CTK20V944/p1614525898114200?thread_ts=1613923591.002100&cid=CTK20V944 ) so we can't make an assessment of whether it actually works for us.
Thanks, its attached.
I also noted that the status on the ClearML is always in 'pending', unlike others which says 'Running'. Is this a side effect of using k8s glue?
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
Hi thanks. How about Agent, does its docker mode or k8s mode require docker.sock to be exposed?
This would be solved if --env GIT_SSL_NO_VERIFY=true is passed to the k8s pod that's spawned to run the job. Currently its not.
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
FriendlySquid61 was this pushed to the clearml-agent?
Hi, clearml-agent==0.17.2rc3 did work. I'm on a 1.19 k8s cluster, and has this error when a task is pulled. Is the glue not compatible with 1.19?
` Pulling task 3a90802d1dfa4ec09fbccba0beffbaa8 launching on kubernetes cluster
Pushing task 3a90802d1dfa4ec09fbccba0beffbaa8 into temporary pending queue
Kubernetes scheduling task id=3a90802d1dfa4ec09fbccba0beffbaa8
kubectl output:
Flag --replicas has been deprecated, has no effect and will be removed in the future.
Flag --generator has been deprecated, has no effect and will be removed in the future.
pod/clearml-gpu-id-3a90802d1dfa4ec09fbccba0beffbaa8 created
Running kubectl encountered an error: Flag --replicas has been deprecated, has no effect and will be removed in the future.
Flag --generator has been deprecated, has no effect and will be removed in the future.
Running kubectl encountered an error: Flag --replicas has been deprecated, has no effect and will be removed in the future.Flag --generator has been deprecated, has no effect and will be removed in the future.No tasks in queue 943fce37803044ef89f6d9af0cd5279c `
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
Hi, please correct me if i am wrong, to use the glue, i need the following.
A k8s cluster A kubectl that is connected to the k8s cluster A pip install of clearml-agent 0.17.1
So i did all the above, I'm not what it meant by running the entire thing on own machine.
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
Unfortunately it's not. The problem previously encountered with the docker method surfaced again. In this case, the BASE DOCKER IMAGE
nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
is not taking effect with the k8s glue.
We will fix it and remove the deprecated flags.
In any case it shouldn't cause issues with your tasks.. Is it running?
Again, assuming you are referring to the helm charts. How are you deploying ClearML?
Hey SubstantialElk6 ,
This issue was fixed in the latest clearml-agent version.
Please try using v0.17.2 🙂
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Thanks 👍 . Should i create an issue on Github?