Nvm, I took a look at conda history and there I see it
Okay this is very close to what the agent is building:
Could you start a new conda env,
then install cudatoolkit=11.1
then run:
conda env update -p <conda_env_path_here> --file the_env_yaml.yml
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
okay, I'll make sure we order it correctly
name: core
channels:
- pytorch
- anaconda
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.10.14
- certifi=2020.6.20
- cloudpickle=1.6.0
- cudatoolkit=11.1.1
- cycler=0.10.0
- cytoolz=0.11.0
- dask-core=2021.2.0
- decorator=4.4.2
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- imageio=2.9.0
- jpeg=9b
- kiwisolver=1.3.1
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libgfortran-ng=7.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff=4.1.0
- libuv=1.41.0
- llvm-openmp=11.0.1
- lz4-c=1.9.3
- matplotlib-base=3.3.4
- mkl=2020.4
- mkl-service=2.3.0
- mkl_fft=1.3.0
- mkl_random=1.2.0
- ncurses=6.2
- nettle=3.6
- networkx=2.5
- ninja=1.10.2
- numpy=1.19.2
- numpy-base=1.19.2
- olefile=0.46
- openh264=2.1.1
- openssl=1.1.1j
- pip=21.0.1
- pyparsing=2.4.7
- python=3.7.10
- python-dateutil=2.8.1
- python_abi=3.7
- pytorch=1.8.0
- pywavelets=1.1.1
- readline=8.1
- scikit-image=0.17.2
- scipy=1.6.1
- setuptools=52.0.0
- six=1.15.0
- sqlite=3.33.0
- tifffile=2020.10.1
- tk=8.6.10
- toolz=0.11.1
- torchaudio=0.8.0
- torchvision=0.9.0
- tornado=6.1
- typing_extensions=3.7.4.3
- wheel=0.36.2
- xz=5.2.5
- yaml=0.2.5
- zlib=1.2.11
- zstd=1.4.9
- pip:
- aiostream==0.4.2
- attrs==20.3.0
- clearml==0.17.4
- dm-control==0.0.355168290
- dm-env==1.4
- furl==2.1.0
- future==0.18.2
- glfw==2.1.0
- gym==0.18.0
- humanfriendly==9.1
- imageio-ffmpeg==0.4.3
- jsonschema==3.2.0
- labmaze==1.0.3
- lxml==4.6.2
- moviepy==1.0.3
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- pillow==7.2.0
- proglog==0.1.9
- psutil==5.8.0
- pybullet==3.0.9
- pygame==2.0.1
- pyglet==1.5.0
- pyjwt==2.0.1
- pyrsistent==0.17.3
- requests-file==1.5.1
- tensorboard==2.4.1
- tensorboardx==2.1
How does clearml-agent create the conda environment?
name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff=4.1.0
- libuv=1.41.0
- llvm-openmp=11.0.1
- lz4-c=1.9.3
- mkl=2020.4
- mkl-service=2.3.0
- mkl_fft=1.3.0
- mkl_random=1.2.0
- ncurses=6.2
- nettle=3.6
- ninja=1.10.2
- numpy=1.19.2
- numpy-base=1.19.2
- olefile=0.46
- openh264=2.1.1
- openssl=1.1.1j
- pillow=8.1.2
- pip=21.0.1
- python=3.8.8
- python_abi=3.8
- pytorch=1.8.0
- readline=8.1
- setuptools=52.0.0
- six=1.15.0
- sqlite=3.33.0
- tk=8.6.10
- torchaudio=0.8.0
- torchvision=0.9.0
- typing_extensions=3.7.4.3
- wheel=0.36.2
- xz=5.2.5
- zlib=1.2.11
- zstd=1.4.9
- pip:
- attrs==20.3.0
- clearml==0.17.4
- furl==2.1.0
- humanfriendly==9.1
- jsonschema==3.2.0
- orderedmultidict==1.0.1
- pathlib2==2.3.5
- psutil==5.8.0
- pyjwt==2.0.1
- pyrsistent==0.17.3
- pyyaml==5.4.1
- requests-file==1.5.1
One question: Does clearml resolve the CUDA Version from driver or conda?
Do you know how I can make sure I do not have CUDA or a broken installation installed?
The problem is that clearml installs
cudatoolkit=11.0
but
cudatoolkit=11.1
is needed.
You suggested this fix earlier, but I am not sure why it didnt work then.
Hmm , could you test with the clearml-agent 0.17.2 ? making surethis actually solves the problem
Hi @<1523701868901961728:profile|ReassuredTiger98> when you get to it...
please download the wheel, then install it with
pip3 install -U clearml_agent-0.17.3rc0-py3-none-any.whl
Then run the daemon with the additional --debug argument, basically:
clearml-agent --debug daemon --foreground ...
Once the agent is running please send the Task's log from your console 🙂
Okay found it 🙂 it returns 11020 instead of 112
I mean the version which it bases the PyTorch installation on.
I tried to run the task with detect_with_conda_freeze: false instead of true and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
The following specifications were found to be incompatible with each other:
Package cudatoolkit conflicts for:
cudatoolkit=11.0
Conda: Installing requirements: step 2 - using pip:
['clearml==0.17.4', 'tensorboard==2.4.1', 'pytorch~=1.8.0']
Collecting tensorboard==2.4.1
Using cached tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
ERROR: Could not find a version that satisfies the requirement pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3)) (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3))
Command 'source /home/tim/miniconda3/etc/profile.d/conda.sh && conda activate /home/tim/.clearml/venvs-builds/3.8 && pip install -r /tmp/cached-reqsubuv0zrf.txt' returned non-zero exit status 1.
Does clearml resolve the CUDA Version from driver or conda?
Actually it starts with the default CUDA based on the host driver, but when it installs the conda env it takes it from the "installed packages" (i.e. the one you used to execute the code in the first place)
Regrading link, I could not find the exact version bu this is close enough I guess:
None
From the logs when ran with --foreground I I do not see any conda create command.
The problem is that clearml installs cudatoolkit=11.0 but cudatoolkit=11.1 is needed. By setting agent.cuda_version=11.1 in clearml.conf it uses the correct version and installs fine. With version 11.0 conda will resolve conflicts by installing pytorch cpu-version.
I guess that has nothing to do with the diff version, right ?
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
@<1523701868901961728:profile|ReassuredTiger98> if you use the latest RC! i sent and run with --debug in the log you will see the full /tmp/conda_envaz1ne897.yml content
Here it is copied from your log, do you want to see if this one works:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
- jpeg~=9b.0
- kiwisolver~=1.3.1
- lame~=3.100
- lcms2~=2.11
- ld_impl_linux-64~=2.33.1
- libedit~=3.1.20191231
- libffi~=3.3
- libgcc-ng~=9.3.0
- libgfortran-ng~=7.3.0
- libiconv~=1.16
- libpng~=1.6.37
- libstdcxx-ng~=9.3.0
- libtiff~=4.1.0
- libuv~=1.41.0
- llvm-openmp~=11.0.1
- lz4-c~=1.9.3
- matplotlib-base~=3.3.4
- mkl~=2020.4
- mkl-service~=2.3.0
- mkl_fft~=1.3.0
- mkl_random~=1.2.0
- ncurses~=6.2
- nettle~=3.6
- networkx~=2.5
- ninja~=1.10.2
- numpy~=1.19.2
- numpy-base~=1.19.2
- olefile~=0.46
- openh264~=2.1.1
- openssl~=1.1.1j
- pyparsing~=2.4.7
- python~=3.7.10
- python-dateutil~=2.8.1
- python_abi~=3.7
- pytorch~=1.8.0
- pywavelets~=1.1.1
- pyyaml~=5.3.1
- readline~=8.1
- scikit-image~=0.17.2
- scipy~=1.6.1
- setuptools~=52.0.0
- six~=1.15.0
- sqlite~=3.33.0
- tifffile~=2020.10.1
- tk~=8.6.10
- toolz~=0.11.1
- torchaudio~=0.8.0
- torchvision~=0.9.0
- tornado~=6.1
- typing_extensions~=3.7.4.3
- wheel~=0.36.2
- xz~=5.2.5
- yaml~=0.2.5
- zlib~=1.2.11
- zstd~=1.4.9
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=1_llvm
absl-py=0.12.0=pypi_0
aiostream=0.4.2=pypi_0
attrs=20.3.0=pypi_0
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
cached-property=1.5.2=pypi_0
cachetools=4.2.1=pypi_0
certifi=2020.6.20=py37_0
chardet=4.0.0=pypi_0
clearml=0.17.4=pypi_0
cloudpickle=1.6.0=py_0
cudatoolkit=11.1.1=h6406543_8
cycler=0.10.0=py37_0
cytoolz=0.11.0=py37h7b6447c_0
dask-core=2021.2.0=pyhd8ed1ab_0
decorator=4.4.2=py_0
dm-control=0.0.355168290=pypi_0
dm-env=1.4=pypi_0
dm-tree=0.1.5=pypi_0
ffmpeg=4.3=hf484d3e_0
freetype=2.10.4=h5ab3b9f_0
furl=2.1.0=pypi_0
future=0.18.2=pypi_0
glfw=2.1.0=pypi_0
gmp=6.2.1=h58526e2_0
gnutls=3.6.13=h85f3911_1
google-auth=1.27.1=pypi_0
google-auth-oauthlib=0.4.3=pypi_0
grpcio=1.36.1=pypi_0
gym=0.18.0=pypi_0
h5py=3.2.1=pypi_0
humanfriendly=9.1=pypi_0
idna=2.10=pypi_0
imageio=2.9.0=py_0
imageio-ffmpeg=0.4.3=pypi_0
importlib-metadata=3.7.2=pypi_0
jpeg=9b=habf39ab_1
jsonschema=3.2.0=pypi_0
kiwisolver=1.3.1=py37h2527ec5_1
labmaze=1.0.3=pypi_0
lame=3.100=h7b6447c_0
lcms2=2.11=h396b838_0
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.3.0=h2828fa1_18
libgfortran-ng=7.3.0=hdf63c60_0
libgomp=9.3.0=h2828fa1_18
libiconv=1.16=h516909a_0
libpng=1.6.37=hbc83047_0
libstdcxx-ng=9.3.0=h6de172a_18
libtiff=4.1.0=h2733197_1
libuv=1.41.0=h7f98852_0
llvm-openmp=11.0.1=h4bd325d_0
lxml=4.6.2=pypi_0
lz4-c=1.9.3=h9c3ff4c_0
markdown=3.3.4=pypi_0
matplotlib-base=3.3.4=py37h0c9df89_0
mkl=2020.4=h726a3e6_304
mkl-service=2.3.0=py37h8f50634_2
mkl_fft=1.3.0=py37h902c9e0_1
mkl_random=1.2.0=py37h9fdb41a_1
moviepy=1.0.3=pypi_0
ncurses=6.2=he6710b0_1
nettle=3.6=he412f7d_0
networkx=2.5=py_0
ninja=1.10.2=h4bd325d_0
numpy=1.19.2=py37h54aff64_0
numpy-base=1.19.2=py37hfa32c7d_0
oauthlib=3.1.0=pypi_0
olefile=0.46=py37_0
openh264=2.1.1=h780b84a_0
openssl=1.1.1j=h7f98852_0
orderedmultidict=1.0.1=pypi_0
pathlib2=2.3.5=pypi_0
pillow=7.2.0=pypi_0
pip=21.0.1=pyhd8ed1ab_0
proglog=0.1.9=pypi_0
protobuf=3.15.5=pypi_0
psutil=5.8.0=pypi_0
pyasn1=0.4.8=pypi_0
pyasn1-modules=0.2.8=pypi_0
pybullet=3.0.9=pypi_0
pygame=2.0.1=pypi_0
pyglet=1.5.0=pypi_0
pyjwt=2.0.1=pypi_0
pyopengl=3.1.5=pypi_0
pyparsing=2.4.7=py_0
pyrsistent=0.17.3=pypi_0
python=3.7.10=hdb3f193_0
python-dateutil=2.8.1=py_0
python_abi=3.7=1_cp37m
pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
pywavelets=1.1.1=py37h7b6447c_2
pyyaml=5.3.1=py37h7b6447c_1
readline=8.1=h27cfd23_0
requests=2.25.1=pypi_0
requests-file=1.5.1=pypi_0
requests-oauthlib=1.3.0=pypi_0
rsa=4.7.2=pypi_0
scikit-image=0.17.2=py37hdf5156a_0
scipy=1.6.1=py37h91f5cce_0
setuptools=52.0.0=py37h06a4308_0
six=1.15.0=py_0
sqlite=3.33.0=h62c20be_0
tensorboard=2.4.1=pypi_0
tensorboard-plugin-wit=1.8.0=pypi_0
tensorboardx=2.1=pypi_0
tifffile=2020.10.1=py37hdd07704_2
tk=8.6.10=hbc83047_0
toolz=0.11.1=py_0
torchaudio=0.8.0=py37
torchvision=0.9.0=py37_cu111
tornado=6.1=py37h5e8e339_1
tqdm=4.59.0=pypi_0
typing_extensions=3.7.4.3=py_0
urllib3=1.26.3=pypi_0
werkzeug=1.0.1=pypi_0
wheel=0.36.2=pyhd3deb0d_0
xz=5.2.5=h7b6447c_0
yaml=0.2.5=h7b6447c_0
zipp=3.4.1=pypi_0
zlib=1.2.11=h7b6447c_3
zstd=1.4.9=ha95c52a_0