You're right, I forgot, ClearML-Agent also tries to match a version to something that will work on the system it's running on
I used the wrong docker container. The docker container I used had version 11.4. Interestingly, the override from clearml.conf and CUDA_VERSION Env variable did not work there.
With the correct docker container everything works fine. Shame on me.
I have to correct myself, I do not even have CUDA installed. Only the driver and everything CUDA-related is provided by the docker container. This works with a container that has CUDA 11.4, but now I have one with 11.6 (latest nvidia pytorch docker).
However, even after changing the clearml.conf and overriding with CUDA_VERSION, the clearml-agent prints on the docker container agent.cuda_version = 114
! (Other changes to the clearml.conf on the agent are reflected in the docker, so only the CUDA version has an issue).
Hi ReassuredTiger98 ,
I think it is something that was logged during the initial run, then the clearml-agent
simply recreates the environment 🙂
Nvm, I think its my mistake. I will investigate.
I am wondering cause when used in docker mode, the docker container may have a CUDA Version that is different from the host version. However, ClearML seems to use the host version instead of the docker container's version, which is a problem sometimes.
Tested with clearml-agent 1.0.1rc4/1.2.2 and clearml 1.3.2
Hi CostlyOstrich36 , thank you for answering so quick. I think that s not how it works because if this was true, one would have to always match local machine to servers. Afaik clearml finds the correct PyTorch Version, but I was not sure how (custom vs pip does it)