But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
I just tried to envrionment setup steps that clearml-agent is doing locally, but with my environment.yml instead of the one that clearml generates.
Can you actually reproduce my problem when also using conda_freeze: true
?
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
For some reason it detect CUDA 11.1 (I assume this is what you have installed, the driver CUDA version is the highest it will support not necessary what you have installed)
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.2 he6710b0_1
openssl 1.1.1j h27cfd23_0
pip 20.0.2 py38_1 conda-forge
python 3.8.8 hdb3f193_4
readline 8.1 h27cfd23_0
setuptools 52.0.0 py38h06a4308_0
sqlite 3.33.0 h62c20be_0
tk 8.6.10 hbc83047_0
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
Yes I think the difference is running conda install with arguments vs conda install with env file...
name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff=4.1.0
- libuv=1.41.0
- llvm-openmp=11.0.1
- lz4-c=1.9.3
- mkl=2020.4
- mkl-service=2.3.0
- mkl_fft=1.3.0
- mkl_random=1.2.0
- ncurses=6.2
- nettle=3.6
- ninja=1.10.2
- numpy=1.19.2
- numpy-base=1.19.2
- olefile=0.46
- openh264=2.1.1
- openssl=1.1.1j
- pillow=8.1.2
- pip=21.0.1
- python=3.8.8
- python_abi=3.8
- pytorch=1.8.0
- readline=8.1
- setuptools=52.0.0
- six=1.15.0
- sqlite=3.33.0
- tk=8.6.10
- torchaudio=0.8.0
- torchvision=0.9.0
- typing_extensions=3.7.4.3
- wheel=0.36.2
- xz=5.2.5
- zlib=1.2.11
- zstd=1.4.9
- pip:
- attrs==20.3.0
- clearml==0.17.4
- furl==2.1.0
- humanfriendly==9.1
- jsonschema==3.2.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pyjwt==2.0.1
- pyrsistent==0.17.3
- pyyaml==5.4.1
- requests-file==1.5.1
It asks the driver or find the cuda dll/so
I just started a task from this environment and it fails on the agent.
Okay found it 🙂 it returns 11020 instead of 112
Or there should be an early error for trying to run conda based tasks on pip agents
No problem! I profit so much from clearml 🙂
One question: Does clearml resolve the CUDA Version from driver or conda?
I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.
Thanks! Tomorrow is great, I'll put the wheel here 🙂
@<1523701868901961728:profile|ReassuredTiger98> what are you getting with:
nvidia-smi
And here:
ls -la /usr/local/
send me the conda freeze:
# Name Version Build Channel
...
It's always preferred to use conda_freeze: false
That said, if you do use conda_freeze: true
it should also freeze the cudatoolkit, so it should have worked.
BTW when you say it worked, is it 0.17.2 version or the hacked RC I sent ?
So only short update for today: I did not yet start a run with conda 4.7.12.
But one question: Actually conda can not be at fault here, right? I can install pytorch just fine locally on the agent, when I do not use clearml(-agent)
btw: why is agent.package_manager
and agent attribute. Imo it does not make sense because conda can install pip packages, but pip cannot install conda packages which can lead to install failures, right?