Do you know how I can make sure I do not have CUDA or a broken installation installed?
I don't think this is the case, it is quite specifically installing the CPU version.
BTW: after the agent fails it will not remove the venv, so you can get into it and check, from the log it will be in: /home/tim/.clearml/venvs-builds/3.7
It asks the driver or find the cuda dll/so
Okay. And
110
means 11.1 and not 11.0? (edited)
110 means 11.0, the odd thing is, it actually installed 11.1, and from the pytorch website this is exactly how they suggest to install with conda...
Let me know if forcing the CUDA version changes anything
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.2 he6710b0_1
openssl 1.1.1j h27cfd23_0
pip 20.0.2 py38_1 conda-forge
python 3.8.8 hdb3f193_4
readline 8.1 h27cfd23_0
setuptools 52.0.0 py38h06a4308_0
sqlite 3.33.0 h62c20be_0
tk 8.6.10 hbc83047_0
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
- jpeg~=9b.0
- kiwisolver~=1.3.1
- lame~=3.100
- lcms2~=2.11
- ld_impl_linux-64~=2.33.1
- libedit~=3.1.20191231
- libffi~=3.3
- libgcc-ng~=9.3.0
- libgfortran-ng~=7.3.0
- libiconv~=1.16
- libpng~=1.6.37
- libstdcxx-ng~=9.3.0
- libtiff~=4.1.0
- libuv~=1.41.0
- llvm-openmp~=11.0.1
- lz4-c~=1.9.3
- matplotlib-base~=3.3.4
- mkl~=2020.4
- mkl-service~=2.3.0
- mkl_fft~=1.3.0
- mkl_random~=1.2.0
- ncurses~=6.2
- nettle~=3.6
- networkx~=2.5
- ninja~=1.10.2
- numpy~=1.19.2
- numpy-base~=1.19.2
- olefile~=0.46
- openh264~=2.1.1
- openssl~=1.1.1j
- pyparsing~=2.4.7
- python~=3.7.10
- python-dateutil~=2.8.1
- python_abi~=3.7
- pytorch~=1.8.0
- pywavelets~=1.1.1
- pyyaml~=5.3.1
- readline~=8.1
- scikit-image~=0.17.2
- scipy~=1.6.1
- setuptools~=52.0.0
- six~=1.15.0
- sqlite~=3.33.0
- tifffile~=2020.10.1
- tk~=8.6.10
- toolz~=0.11.1
- torchaudio~=0.8.0
- torchvision~=0.9.0
- tornado~=6.1
- typing_extensions~=3.7.4.3
- wheel~=0.36.2
- xz~=5.2.5
- yaml~=0.2.5
- zlib~=1.2.11
- zstd~=1.4.9
Okay this seems correct:
pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
I can't seem to find what's the diff between the two.
Give me a second let me check if I can reproduce it somehow.
name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff=4.1.0
- libuv=1.41.0
- llvm-openmp=11.0.1
- lz4-c=1.9.3
- mkl=2020.4
- mkl-service=2.3.0
- mkl_fft=1.3.0
- mkl_random=1.2.0
- ncurses=6.2
- nettle=3.6
- ninja=1.10.2
- numpy=1.19.2
- numpy-base=1.19.2
- olefile=0.46
- openh264=2.1.1
- openssl=1.1.1j
- pillow=8.1.2
- pip=21.0.1
- python=3.8.8
- python_abi=3.8
- pytorch=1.8.0
- readline=8.1
- setuptools=52.0.0
- six=1.15.0
- sqlite=3.33.0
- tk=8.6.10
- torchaudio=0.8.0
- torchvision=0.9.0
- typing_extensions=3.7.4.3
- wheel=0.36.2
- xz=5.2.5
- zlib=1.2.11
- zstd=1.4.9
- pip:
- attrs==20.3.0
- clearml==0.17.4
- furl==2.1.0
- humanfriendly==9.1
- jsonschema==3.2.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pyjwt==2.0.1
- pyrsistent==0.17.3
- pyyaml==5.4.1
- requests-file==1.5.1
Can you ping me when it is updated in None so I can update my installation?
Yes I think the difference is running conda install with arguments vs conda install with env file...
So to further debug I need to somehow access /tmp/conda_envaz1ne897.yml
Now I get:
ollecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch==1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
The following specifications were found to be incompatible with each other:
Package cudatoolkit conflicts for:
cudatoolkit=11.0
Installed miniconda finally, now trying to run the task
Hi @<1523701868901961728:profile|ReassuredTiger98>
Could you send the full log ? Also what's the clearml-agent
version?
Still shows CPU version when I run conda list