I installed my local conda environment from an environment.yml
without issues, so maybe clearml makes some changes that leads to conflicts which finally leads to the cpu-version install.
For now I can tell you that with conda_freeze: true
it fails, but with conda_freeze: false
it works!
The problem is that clearml installs
cudatoolkit=11.0
but
cudatoolkit=11.1
is needed.
You suggested this fix earlier, but I am not sure why it didnt work then.
Hmm , could you test with the clearml-agent 0.17.2 ? making surethis actually solves the problem
You suggested this fix earlier, but I am not sure why it didnt work then.
The problem is that clearml installs cudatoolkit=11.0
but cudatoolkit=11.1
is needed. By setting agent.cuda_version=11.1
in clearml.conf
it uses the correct version and installs fine. With version 11.0
conda will resolve conflicts by installing pytorch cpu-version.
Is ther a way to see the contents of /tmp/conda_envaz1ne897.yml
? Seems to be deleted after the task is finihsed
==> 2021-03-11 12:50:38 <==
# cmd: /home/tim/miniconda3/condabin/conda create --yes --mkdir --prefix /home/tim/.clearml/venvs-builds/3.8 python=3.8
--
==> 2021-03-11 12:50:40 <==
# cmd: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch cudatoolkit=11.0 --quiet --json
--
==> 2021-03-11 12:50:43 <==
# cmd: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch pip<20.2 --quiet --json
--
==> 2021-03-11 12:51:17 <==
# cmd: /home/tim/miniconda3/bin/conda-env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envaz1ne897.yml --quiet --json
Nvm, I took a look at conda history and there I see it
From the logs when ran with --foreground I
I do not see any conda create
command.
How does clearml-agent create the conda environment?
Interesting: This command failes (with an error similar to the one I posted above) in conda version 4.7.12 but runs just fine in version 4.9.2: conda create --name test-pytorch python=3.8 cudatoolkit=11.1 -c conda-forge
Tried to install cudatoolkit==11.1 manually in this environemnt and got:
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package xz conflicts for:
python=3.8 -> xz[version='>=5.2.4,<5.3.0a0|>=5.2.4,<6.0a0|>=5.2.5,<5.3.0a0|>=5.2.5,<6.0a0']
Package libstdcxx-ng conflicts for:
python=3.8 -> libstdcxx-ng[version='>=7.3.0|>=7.5.0|>=9.3.0']
cudatoolkit=11.1 -> libstdcxx-ng[version='>=9.3.0']
Package libgcc-ng conflicts for:
cudatoolkit=11.1 -> libgcc-ng[version='>=9.3.0']
python=3.8 -> libgcc-ng[version='>=7.3.0|>=7.5.0|>=9.3.0']
Package __glibc conflicts for:
cudatoolkit=11.1 -> __glibc[version='>=2.17,<3.0.a0']
Package libffi conflicts for:
python=3.8 -> libffi[version='>=3.2.1,<3.3.0a0|>=3.2.1,<3.3a0|>=3.3,<3.4.0a0']
Package ncurses conflicts for:
python=3.8 -> ncurses[version='>=6.1,<6.3.0a0|>=6.1,<7.0a0|>=6.2,<6.3.0a0|>=6.2,<7.0a0']
Package zlib conflicts for:
python=3.8 -> zlib[version='>=1.2.11,<1.3.0a0']
Package python_abi conflicts for:
python=3.8 -> python_abi[version='*|3.8.*',build=*_cp38]
Package sqlite conflicts for:
python=3.8 -> sqlite[version='>=3.30.0,<4.0a0|>=3.30.1,<4.0a0|>=3.31.1,<4.0a0|>=3.32.3,<4.0a0|>=3.33.0,<4.0a0|>=3.34.0,<4.0a0']
Package bzip2 conflicts for:
python=3.8 -> bzip2[version='>=1.0.8,<2.0a0']
Package readline conflicts for:
python=3.8 -> readline[version='>=7.0,<8.0a0|>=8.0,<9.0a0']
Package openssl conflicts for:
python=3.8 -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a|>=1.1.1h,<1.1.2a|>=1.1.1i,<1.1.2a|>=1.1.1j,<1.1.2a']
Package tk conflicts for:
python=3.8 -> tk[version='>=8.6.10,<8.7.0a0|>=8.6.8,<8.7.0a0|>=8.6.9,<8.7.0a0']
Package pip conflicts for:
python=3.8 -> pip
Package ld_impl_linux-64 conflicts for:
python=3.8 -> ld_impl_linux-64[version='>=2.34']The following specifications were found to be incompatible with your CUDA driver:
- cudatoolkit=11.1 -> __cuda[version='>=11.1']
Your installed CUDA driver is: 11.2
Now I get:
ollecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch==1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
The following specifications were found to be incompatible with each other:
Package cudatoolkit conflicts for:
cudatoolkit=11.0
conda env update -p .clearml/venvs-builds/3.8 ./environment.yml
with environment.yml
name: clearml
channels:
- pytorch
- anaconda
- conda-forge
- defaults
dependencies:
- pytorch==1.8.0
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.2 he6710b0_1
openssl 1.1.1j h27cfd23_0
pip 20.0.2 py38_1 conda-forge
python 3.8.8 hdb3f193_4
readline 8.1 h27cfd23_0
setuptools 52.0.0 py38h06a4308_0
sqlite 3.33.0 h62c20be_0
tk 8.6.10 hbc83047_0
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
Uninstall the current clearml-agent and reinstall this wheel, I hacked it to have ==, let's see if that works
Maybe the ~= is breaking the conda "magic" version resolver
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
I tried to run the task with detect_with_conda_freeze: false
instead of true
and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
The following specifications were found to be incompatible with each other:
Package cudatoolkit conflicts for:
cudatoolkit=11.0
Conda: Installing requirements: step 2 - using pip:
['clearml==0.17.4', 'tensorboard==2.4.1', 'pytorch~=1.8.0']
Collecting tensorboard==2.4.1
Using cached tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
ERROR: Could not find a version that satisfies the requirement pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3)) (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3))
Command 'source /home/tim/miniconda3/etc/profile.d/conda.sh && conda activate /home/tim/.clearml/venvs-builds/3.8 && pip install -r /tmp/cached-reqsubuv0zrf.txt' returned non-zero exit status 1.
Also tried conda version 4.7.12. Same problem.
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
- jpeg~=9b.0
- kiwisolver~=1.3.1
- lame~=3.100
- lcms2~=2.11
- ld_impl_linux-64~=2.33.1
- libedit~=3.1.20191231
- libffi~=3.3
- libgcc-ng~=9.3.0
- libgfortran-ng~=7.3.0
- libiconv~=1.16
- libpng~=1.6.37
- libstdcxx-ng~=9.3.0
- libtiff~=4.1.0
- libuv~=1.41.0
- llvm-openmp~=11.0.1
- lz4-c~=1.9.3
- matplotlib-base~=3.3.4
- mkl~=2020.4
- mkl-service~=2.3.0
- mkl_fft~=1.3.0
- mkl_random~=1.2.0
- ncurses~=6.2
- nettle~=3.6
- networkx~=2.5
- ninja~=1.10.2
- numpy~=1.19.2
- numpy-base~=1.19.2
- olefile~=0.46
- openh264~=2.1.1
- openssl~=1.1.1j
- pyparsing~=2.4.7
- python~=3.7.10
- python-dateutil~=2.8.1
- python_abi~=3.7
- pytorch~=1.8.0
- pywavelets~=1.1.1
- pyyaml~=5.3.1
- readline~=8.1
- scikit-image~=0.17.2
- scipy~=1.6.1
- setuptools~=52.0.0
- six~=1.15.0
- sqlite~=3.33.0
- tifffile~=2020.10.1
- tk~=8.6.10
- toolz~=0.11.1
- torchaudio~=0.8.0
- torchvision~=0.9.0
- tornado~=6.1
- typing_extensions~=3.7.4.3
- wheel~=0.36.2
- xz~=5.2.5
- yaml~=0.2.5
- zlib~=1.2.11
- zstd~=1.4.9
btw: I also tested the clearml-agent running on a different machine and with python 3.8 and I get the same problems.