Oh you are right. I did not think this through... To implement this properly it gets to enterprisy for me, so I ll just leave it for now :D
So I just updated the env that clearml-agent created (and where pytorch cpu is installed) with my local environment.yml and now the correct version is installed, so most probably the `/tmp/conda_envaz1ne897.yml`` is the problem here
I see, so it is actually not related to clearml 🎉
In the first run the package only existed because it is preinstalled in the docker image. Afaik, in the second run it is also preinstalled, but pip will first try to resolve it and then see whether it already exists. But I am not to sure about this.
Would it help you diagnose this problem if I ran conda env create --file=environment.yml
and see whether it works?
This my environment installed from env file. Training works just fine here:
First one is the original, second one the clone
The agent is run with pip. However, the docker image uses conda (because NVIDIA uses conda to build PyTorch most probably). My theory is that when the task is run the first time on an agent, Task.init will update the requirements. Then when ran a second time, the task will contain the requirements of the (conda-) environment from the first run.
clearml will register conda packages that cannot be installed if clearml-agent is configured to use pip. So although it is nice that a complete package list is tracked, it makes it cumbersome to rerun the experiment.
I am currently on the move, but it was something like upstream server not found in /etc/nginx/nginx.conf and if I remember correctly line 88
Afaik, clearml-agent will use existing installed packages if they fit the requirements.txt. E.g. pytorch >= 1.7
will only install PyTorch if the environment does not already provide some version of PyTorch greater or equal to 1.7.
Outside of the cleaml.Task?
Ah, nevermind. I thought wrong here.
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit==11.1.1
- pytorch==1.8.0
Gives CPU version
I randocker run -it -v /home/hostuser/.ssh/:/root/.ssh ubuntu:18.04
but cloning does not work and this is what ls -lah /root/.ssh
gives inside the docker container:
` -rw------- 1 1001 1001 1.5K Apr 8 12:28 authorized_keys
-rw-rw-r-- 1 1001 1001 208 Apr 29 09:15 config
-rw------- 1 1001 1001 432 Apr 8 12:53 id_ed25519
-rw-r--r-- 1 1001 1001 119 Apr 8 12:53 id_ed25519.pub
-rw------- 1 1001 1001 432 Apr 29 09:16 id_gitlab
-rw-r--r-- 1 1001 1001 119 Apr 29 09:25 id_gitlab.pub
-...
But here is the funny thing:
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Installs GPU
Maybe the difference is that I am using pipnow and I used to use conda! The NVIDIA PyTorch container uses conda. Could that be a reason?