Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello! Since Today I Get

Hello!
Since today I get AssertionError: Torch not compiled with CUDA enabled for PyTorch 1.8.
Tasks that I submitted yesterday to the queue are also not working, even though they ran yesterday. PyTorch 1.7 based tasks work fine. Any idea what I could have done wrong?

  
  
Posted 3 years ago
Votes Newest

Answers 161


But here is the funny thing:

channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0

Installs GPU

  
  
Posted 3 years ago

Hurray conda.
Notice it does include cudatoolkit , but conda ignores it

cudatoolkit~=11.1.1

Can you test the same one only serach and replace ~= with == ?

  
  
Posted 3 years ago

Also tried conda version 4.7.12. Same problem.

  
  
Posted 3 years ago

name: core
channels:
  - pytorch
  - anaconda
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1
  - _openmp_mutex=4.5
  - blas=1.0
  - bzip2=1.0.8
  - ca-certificates=2020.10.14
  - certifi=2020.6.20
  - cloudpickle=1.6.0
  - cudatoolkit=11.1.1
  - cycler=0.10.0
  - cytoolz=0.11.0
  - dask-core=2021.2.0
  - decorator=4.4.2
  - ffmpeg=4.3
  - freetype=2.10.4
  - gmp=6.2.1
  - gnutls=3.6.13
  - imageio=2.9.0
  - jpeg=9b
  - kiwisolver=1.3.1
  - lame=3.100
  - lcms2=2.11
  - ld_impl_linux-64=2.33.1
  - libedit=3.1.20191231
  - libffi=3.3
  - libgcc-ng=9.3.0
  - libgfortran-ng=7.3.0
  - libiconv=1.16
  - libpng=1.6.37
  - libstdcxx-ng=9.3.0
  - libtiff=4.1.0
  - libuv=1.41.0
  - llvm-openmp=11.0.1
  - lz4-c=1.9.3
  - matplotlib-base=3.3.4
  - mkl=2020.4
  - mkl-service=2.3.0
  - mkl_fft=1.3.0
  - mkl_random=1.2.0
  - ncurses=6.2
  - nettle=3.6
  - networkx=2.5
  - ninja=1.10.2
  - numpy=1.19.2
  - numpy-base=1.19.2
  - olefile=0.46
  - openh264=2.1.1
  - openssl=1.1.1j
  - pip=21.0.1
  - pyparsing=2.4.7
  - python=3.7.10
  - python-dateutil=2.8.1
  - python_abi=3.7
  - pytorch=1.8.0
  - pywavelets=1.1.1
  - readline=8.1
  - scikit-image=0.17.2
  - scipy=1.6.1
  - setuptools=52.0.0
  - six=1.15.0
  - sqlite=3.33.0
  - tifffile=2020.10.1
  - tk=8.6.10
  - toolz=0.11.1
  - torchaudio=0.8.0
  - torchvision=0.9.0
  - tornado=6.1
  - typing_extensions=3.7.4.3
  - wheel=0.36.2
  - xz=5.2.5
  - yaml=0.2.5
  - zlib=1.2.11
  - zstd=1.4.9
  - pip:
    - aiostream==0.4.2
    - attrs==20.3.0
    - clearml==0.17.4
    - dm-control==0.0.355168290
    - dm-env==1.4
    - furl==2.1.0
    - future==0.18.2
    - glfw==2.1.0
    - gym==0.18.0
    - humanfriendly==9.1
    - imageio-ffmpeg==0.4.3
    - jsonschema==3.2.0
    - labmaze==1.0.3
    - lxml==4.6.2
    - moviepy==1.0.3
    - orderedmultidict==1.0.1
    - pathlib2==2.3.5
    - pillow==7.2.0
    - proglog==0.1.9
    - psutil==5.8.0
    - pybullet==3.0.9
    - pygame==2.0.1
    - pyglet==1.5.0
    - pyjwt==2.0.1
    - pyrsistent==0.17.3
    - requests-file==1.5.1
    - tensorboard==2.4.1
    - tensorboardx==2.1
  
  
Posted 3 years ago

By host you mean the machine on which the agent is running? How does clearml-agent find the cuda_version?

  
  
Posted 3 years ago

This my environment installed from env file. Training works just fine here:

  
  
Posted 3 years ago

Sure, I ll try this

  
  
Posted 3 years ago

And how is

Summary - installed python packages: 
conda:
....

generated?

  
  
Posted 3 years ago

It is now looking for conflicts.

  
  
Posted 3 years ago

Okay. And 

110

 means 11.1 and not 11.0? (edited)

110 means 11.0, the odd thing is, it actually installed 11.1, and from the pytorch website this is exactly how they suggest to install with conda...
Let me know if forcing the CUDA version changes anything

  
  
Posted 3 years ago

But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me

From the log it installed:
cudatoolkit==11.1.1
based on the CUDA it found on the host machine: agent.cuda_version = 110
But for some reason it installed the pytorch from the conda "pytorch" repo without the cuda support.

  
  
Posted 3 years ago

Uninstall the current clearml-agent and reinstall this wheel, I hacked it to have ==, let's see if that works

  
  
Posted 3 years ago

Perfect! I have to thank you for helping me, not the other way around!

  
  
Posted 3 years ago

Where again does clearml place the venv?

Usually ~/.clearml/venvs-builds/<python version>/
Multiple agents will be venvs-builds.1 and so on

  
  
Posted 3 years ago

Do you know how I can make sure I do not have CUDA or a broken installation installed?

I don't think this is the case, it is quite specifically installing the CPU version.
BTW: after the agent fails it will not remove the venv, so you can get into it and check, from the log it will be in: /home/tim/.clearml/venvs-builds/3.7

  
  
Posted 3 years ago

I can install pytorch just fine locally on the agent, when I do not use clearml(-agent)

My thinking is the issue might be on the env file we are passing to conda, I can't find any other diff.
BTW:
@<1523701868901961728:profile|ReassuredTiger98> Can I send a specific wheel with mode debug prints for you to check (basically it will print the conda env YAML it is using)?

  
  
Posted 3 years ago

Let me check

  
  
Posted 3 years ago

I tried to run the task with detect_with_conda_freeze: false instead of true and got

Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):

  - pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']

The following specifications were found to be incompatible with each other:



Package cudatoolkit conflicts for:
cudatoolkit=11.0
Conda: Installing requirements: step 2 - using pip:
['clearml==0.17.4', 'tensorboard==2.4.1', 'pytorch~=1.8.0']
Collecting tensorboard==2.4.1
  Using cached tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
ERROR: Could not find a version that satisfies the requirement pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3)) (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3))
Command 'source /home/tim/miniconda3/etc/profile.d/conda.sh && conda activate /home/tim/.clearml/venvs-builds/3.8 && pip install -r /tmp/cached-reqsubuv0zrf.txt' returned non-zero exit status 1.
  
  
Posted 3 years ago

Like this?

  
  
Posted 3 years ago

@<1523701868901961728:profile|ReassuredTiger98> it works on my machine 😞

  
  
Posted 3 years ago

Tried to install cudatoolkit==11.1 manually in this environemnt and got:

Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                         

UnsatisfiableError: The following specifications were found to be incompatible with each other:



Package xz conflicts for:
python=3.8 -> xz[version='>=5.2.4,<5.3.0a0|>=5.2.4,<6.0a0|>=5.2.5,<5.3.0a0|>=5.2.5,<6.0a0']
Package libstdcxx-ng conflicts for:
python=3.8 -> libstdcxx-ng[version='>=7.3.0|>=7.5.0|>=9.3.0']
cudatoolkit=11.1 -> libstdcxx-ng[version='>=9.3.0']
Package libgcc-ng conflicts for:
cudatoolkit=11.1 -> libgcc-ng[version='>=9.3.0']
python=3.8 -> libgcc-ng[version='>=7.3.0|>=7.5.0|>=9.3.0']
Package __glibc conflicts for:
cudatoolkit=11.1 -> __glibc[version='>=2.17,<3.0.a0']
Package libffi conflicts for:
python=3.8 -> libffi[version='>=3.2.1,<3.3.0a0|>=3.2.1,<3.3a0|>=3.3,<3.4.0a0']
Package ncurses conflicts for:
python=3.8 -> ncurses[version='>=6.1,<6.3.0a0|>=6.1,<7.0a0|>=6.2,<6.3.0a0|>=6.2,<7.0a0']
Package zlib conflicts for:
python=3.8 -> zlib[version='>=1.2.11,<1.3.0a0']
Package python_abi conflicts for:
python=3.8 -> python_abi[version='*|3.8.*',build=*_cp38]
Package sqlite conflicts for:
python=3.8 -> sqlite[version='>=3.30.0,<4.0a0|>=3.30.1,<4.0a0|>=3.31.1,<4.0a0|>=3.32.3,<4.0a0|>=3.33.0,<4.0a0|>=3.34.0,<4.0a0']
Package bzip2 conflicts for:
python=3.8 -> bzip2[version='>=1.0.8,<2.0a0']
Package readline conflicts for:
python=3.8 -> readline[version='>=7.0,<8.0a0|>=8.0,<9.0a0']
Package openssl conflicts for:
python=3.8 -> openssl[version='>=1.1.1a,<1.1.2a|>=1.1.1d,<1.1.2a|>=1.1.1e,<1.1.2a|>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a|>=1.1.1h,<1.1.2a|>=1.1.1i,<1.1.2a|>=1.1.1j,<1.1.2a']
Package tk conflicts for:
python=3.8 -> tk[version='>=8.6.10,<8.7.0a0|>=8.6.8,<8.7.0a0|>=8.6.9,<8.7.0a0']
Package pip conflicts for:
python=3.8 -> pip
Package ld_impl_linux-64 conflicts for:
python=3.8 -> ld_impl_linux-64[version='>=2.34']The following specifications were found to be incompatible with your CUDA driver:

  - cudatoolkit=11.1 -> __cuda[version='>=11.1']

Your installed CUDA driver is: 11.2
  
  
Posted 3 years ago

I just started a task from this environment and it fails on the agent.

  
  
Posted 3 years ago

Sure, let's do that 🙂

  
  
Posted 3 years ago

Upgrade back?

  
  
Posted 3 years ago

Installs CPU

  
  
Posted 3 years ago

send me the conda freeze:

# Name                    Version                   Build  Channel
...
  
  
Posted 3 years ago

I just wanna add: I can run this task on the same workstation with the same conda installation just fine.

  
  
Posted 3 years ago

The problem is that clearml installs 

cudatoolkit=11.0

 but 

cudatoolkit=11.1

 is needed.
You suggested this fix earlier, but I am not sure why it didnt work then.

Hmm , could you test with the clearml-agent 0.17.2 ? making surethis actually solves the problem

  
  
Posted 3 years ago

So I just updated the env that clearml-agent created (and where pytorch cpu is installed) with my local environment.yml and now the correct version is installed, so most probably the `/tmp/conda_envaz1ne897.yml`` is the problem here

  
  
Posted 3 years ago

And then?

  
  
Posted 3 years ago
18K Views
161 Answers
3 years ago
7 months ago
Tags