Okay this seems correct:
pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
I can't seem to find what's the diff between the two.
Give me a second let me check if I can reproduce it somehow.
name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff=4.1.0
- libuv=1.41.0
- llvm-openmp=11.0.1
- lz4-c=1.9.3
- mkl=2020.4
- mkl-service=2.3.0
- mkl_fft=1.3.0
- mkl_random=1.2.0
- ncurses=6.2
- nettle=3.6
- ninja=1.10.2
- numpy=1.19.2
- numpy-base=1.19.2
- olefile=0.46
- openh264=2.1.1
- openssl=1.1.1j
- pillow=8.1.2
- pip=21.0.1
- python=3.8.8
- python_abi=3.8
- pytorch=1.8.0
- readline=8.1
- setuptools=52.0.0
- six=1.15.0
- sqlite=3.33.0
- tk=8.6.10
- torchaudio=0.8.0
- torchvision=0.9.0
- typing_extensions=3.7.4.3
- wheel=0.36.2
- xz=5.2.5
- zlib=1.2.11
- zstd=1.4.9
- pip:
- attrs==20.3.0
- clearml==0.17.4
- furl==2.1.0
- humanfriendly==9.1
- jsonschema==3.2.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pyjwt==2.0.1
- pyrsistent==0.17.3
- pyyaml==5.4.1
- requests-file==1.5.1
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
And how is
Summary - installed python packages:
conda:
....
generated?
Nvm, I took a look at conda history and there I see it
I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit==11.1.1
- pytorch==1.8.0
Gives CPU version
I just wanna add: I can run this task on the same workstation with the same conda installation just fine.
Would it help you diagnose this problem if I ran conda env create --file=environment.yml and see whether it works?
So I just updated the env that clearml-agent created (and where pytorch cpu is installed) with my local environment.yml and now the correct version is installed, so most probably the `/tmp/conda_envaz1ne897.yml`` is the problem here
btw: why is agent.package_manager and agent attribute. Imo it does not make sense because conda can install pip packages, but pip cannot install conda packages which can lead to install failures, right?
Thanks! Tomorrow is great, I'll put the wheel here 🙂
But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
From the log it installed:cudatoolkit==11.1.1
based on the CUDA it found on the host machine: agent.cuda_version = 110
But for some reason it installed the pytorch from the conda "pytorch" repo without the cuda support.
okay, I'll make sure we order it correctly
But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
@<1523701868901961728:profile|ReassuredTiger98> if you use the latest RC! i sent and run with --debug in the log you will see the full /tmp/conda_envaz1ne897.yml content
Here it is copied from your log, do you want to see if this one works:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
- jpeg~=9b.0
- kiwisolver~=1.3.1
- lame~=3.100
- lcms2~=2.11
- ld_impl_linux-64~=2.33.1
- libedit~=3.1.20191231
- libffi~=3.3
- libgcc-ng~=9.3.0
- libgfortran-ng~=7.3.0
- libiconv~=1.16
- libpng~=1.6.37
- libstdcxx-ng~=9.3.0
- libtiff~=4.1.0
- libuv~=1.41.0
- llvm-openmp~=11.0.1
- lz4-c~=1.9.3
- matplotlib-base~=3.3.4
- mkl~=2020.4
- mkl-service~=2.3.0
- mkl_fft~=1.3.0
- mkl_random~=1.2.0
- ncurses~=6.2
- nettle~=3.6
- networkx~=2.5
- ninja~=1.10.2
- numpy~=1.19.2
- numpy-base~=1.19.2
- olefile~=0.46
- openh264~=2.1.1
- openssl~=1.1.1j
- pyparsing~=2.4.7
- python~=3.7.10
- python-dateutil~=2.8.1
- python_abi~=3.7
- pytorch~=1.8.0
- pywavelets~=1.1.1
- pyyaml~=5.3.1
- readline~=8.1
- scikit-image~=0.17.2
- scipy~=1.6.1
- setuptools~=52.0.0
- six~=1.15.0
- sqlite~=3.33.0
- tifffile~=2020.10.1
- tk~=8.6.10
- toolz~=0.11.1
- torchaudio~=0.8.0
- torchvision~=0.9.0
- tornado~=6.1
- typing_extensions~=3.7.4.3
- wheel~=0.36.2
- xz~=5.2.5
- yaml~=0.2.5
- zlib~=1.2.11
- zstd~=1.4.9
Yes I think the difference is running conda install with arguments vs conda install with env file...
Perfect, will try it. fyi: The conda_channels that I used are from clearml-agent init
This my environment installed from env file. Training works just fine here: