One question: Does clearml resolve the CUDA Version from driver or conda?
I guess that has nothing to do with the diff version, right ?
One more thing: The cuda_version that clearml finds automatically is wrong.
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=1_llvm
absl-py=0.12.0=pypi_0
aiostream=0.4.2=pypi_0
attrs=20.3.0=pypi_0
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
cached-property=1.5.2=pypi_0
cachetools=4.2.1=pypi_0
certifi=2020.6.20=py37_0
chardet=4.0.0=pypi_0
clearml=0.17.4=pypi_0
cloudpickle=1.6.0=py_0
cudatoolkit=11.1.1=h6406543_8
cycler=0.10.0=py37_0
cytoolz=0.11.0=py37h7b6447c_0
dask-core=2021.2.0=pyhd8ed1ab_0
decorator=4.4.2=py_0
dm-control=0.0.355168290=pypi_0
dm-env=1.4=pypi_0
dm-tree=0.1.5=pypi_0
ffmpeg=4.3=hf484d3e_0
freetype=2.10.4=h5ab3b9f_0
furl=2.1.0=pypi_0
future=0.18.2=pypi_0
glfw=2.1.0=pypi_0
gmp=6.2.1=h58526e2_0
gnutls=3.6.13=h85f3911_1
google-auth=1.27.1=pypi_0
google-auth-oauthlib=0.4.3=pypi_0
grpcio=1.36.1=pypi_0
gym=0.18.0=pypi_0
h5py=3.2.1=pypi_0
humanfriendly=9.1=pypi_0
idna=2.10=pypi_0
imageio=2.9.0=py_0
imageio-ffmpeg=0.4.3=pypi_0
importlib-metadata=3.7.2=pypi_0
jpeg=9b=habf39ab_1
jsonschema=3.2.0=pypi_0
kiwisolver=1.3.1=py37h2527ec5_1
labmaze=1.0.3=pypi_0
lame=3.100=h7b6447c_0
lcms2=2.11=h396b838_0
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.3.0=h2828fa1_18
libgfortran-ng=7.3.0=hdf63c60_0
libgomp=9.3.0=h2828fa1_18
libiconv=1.16=h516909a_0
libpng=1.6.37=hbc83047_0
libstdcxx-ng=9.3.0=h6de172a_18
libtiff=4.1.0=h2733197_1
libuv=1.41.0=h7f98852_0
llvm-openmp=11.0.1=h4bd325d_0
lxml=4.6.2=pypi_0
lz4-c=1.9.3=h9c3ff4c_0
markdown=3.3.4=pypi_0
matplotlib-base=3.3.4=py37h0c9df89_0
mkl=2020.4=h726a3e6_304
mkl-service=2.3.0=py37h8f50634_2
mkl_fft=1.3.0=py37h902c9e0_1
mkl_random=1.2.0=py37h9fdb41a_1
moviepy=1.0.3=pypi_0
ncurses=6.2=he6710b0_1
nettle=3.6=he412f7d_0
networkx=2.5=py_0
ninja=1.10.2=h4bd325d_0
numpy=1.19.2=py37h54aff64_0
numpy-base=1.19.2=py37hfa32c7d_0
oauthlib=3.1.0=pypi_0
olefile=0.46=py37_0
openh264=2.1.1=h780b84a_0
openssl=1.1.1j=h7f98852_0
orderedmultidict=1.0.1=pypi_0
pathlib2=2.3.5=pypi_0
pillow=7.2.0=pypi_0
pip=21.0.1=pyhd8ed1ab_0
proglog=0.1.9=pypi_0
protobuf=3.15.5=pypi_0
psutil=5.8.0=pypi_0
pyasn1=0.4.8=pypi_0
pyasn1-modules=0.2.8=pypi_0
pybullet=3.0.9=pypi_0
pygame=2.0.1=pypi_0
pyglet=1.5.0=pypi_0
pyjwt=2.0.1=pypi_0
pyopengl=3.1.5=pypi_0
pyparsing=2.4.7=py_0
pyrsistent=0.17.3=pypi_0
python=3.7.10=hdb3f193_0
python-dateutil=2.8.1=py_0
python_abi=3.7=1_cp37m
pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
pywavelets=1.1.1=py37h7b6447c_2
pyyaml=5.3.1=py37h7b6447c_1
readline=8.1=h27cfd23_0
requests=2.25.1=pypi_0
requests-file=1.5.1=pypi_0
requests-oauthlib=1.3.0=pypi_0
rsa=4.7.2=pypi_0
scikit-image=0.17.2=py37hdf5156a_0
scipy=1.6.1=py37h91f5cce_0
setuptools=52.0.0=py37h06a4308_0
six=1.15.0=py_0
sqlite=3.33.0=h62c20be_0
tensorboard=2.4.1=pypi_0
tensorboard-plugin-wit=1.8.0=pypi_0
tensorboardx=2.1=pypi_0
tifffile=2020.10.1=py37hdd07704_2
tk=8.6.10=hbc83047_0
toolz=0.11.1=py_0
torchaudio=0.8.0=py37
torchvision=0.9.0=py37_cu111
tornado=6.1=py37h5e8e339_1
tqdm=4.59.0=pypi_0
typing_extensions=3.7.4.3=py_0
urllib3=1.26.3=pypi_0
werkzeug=1.0.1=pypi_0
wheel=0.36.2=pyhd3deb0d_0
xz=5.2.5=h7b6447c_0
yaml=0.2.5=h7b6447c_0
zipp=3.4.1=pypi_0
zlib=1.2.11=h7b6447c_3
zstd=1.4.9=ha95c52a_0
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.2 he6710b0_1
openssl 1.1.1j h27cfd23_0
pip 20.0.2 py38_1 conda-forge
python 3.8.8 hdb3f193_4
readline 8.1 h27cfd23_0
setuptools 52.0.0 py38h06a4308_0
sqlite 3.33.0 h62c20be_0
tk 8.6.10 hbc83047_0
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
Hmm, you are correct
Which means this is some conda issue, basically when installing from env file, conda is not resolving the correct pytorch version 😞
Not sure why... Could you try to upgrade conda ?
Does clearml resolve the CUDA Version from driver or conda?
Actually it starts with the default CUDA based on the host driver, but when it installs the conda env it takes it from the "installed packages" (i.e. the one you used to execute the code in the first place)
Regrading link, I could not find the exact version bu this is close enough I guess:
None
Perfect, will try it. fyi: The conda_channels that I used are from clearml-agent init
But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
Can you actually reproduce my problem when also using conda_freeze: true
?
Would it help you diagnose this problem if I ran conda env create --file=environment.yml
and see whether it works?
I installed my local conda environment from an environment.yml
without issues, so maybe clearml makes some changes that leads to conflicts which finally leads to the cpu-version install.
fyi: NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2
It asks the driver or find the cuda dll/so
I mean the version which it bases the PyTorch installation on.
@<1523701868901961728:profile|ReassuredTiger98> what are you getting with:
nvidia-smi
And here:
ls -la /usr/local/
Hi @<1523701868901961728:profile|ReassuredTiger98>
Could you send the full log ? Also what's the clearml-agent
version?
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
So only short update for today: I did not yet start a run with conda 4.7.12.
But one question: Actually conda can not be at fault here, right? I can install pytorch just fine locally on the agent, when I do not use clearml(-agent)
Do you know how I can get this version?