Hello! Since Today I Get

Answered

Hello!
Since today I get AssertionError: Torch not compiled with CUDA enabled for PyTorch 1.8.
Tasks that I submitted yesterday to the queue are also not working, even though they ran yesterday. PyTorch 1.7 based tasks work fine. Any idea what I could have done wrong?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Votes Newest

Answers 161

It's always preferred to use conda_freeze: false
That said, if you do use conda_freeze: true it should also freeze the cudatoolkit, so it should have worked.
BTW when you say it worked, is it 0.17.2 version or the hacked RC I sent ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

drwxr-xr-x 10 root root 4096 Jul 31  2020 .
drwxr-xr-x 14 root root 4096 Jul 31  2020 ..
drwxr-xr-x  2 root root 4096 Feb  4 13:52 bin
drwxr-xr-x  2 root root 4096 Jul 31  2020 etc
drwxr-xr-x  2 root root 4096 Jul 31  2020 games
drwxr-xr-x  2 root root 4096 Jul 31  2020 include
drwxr-xr-x  4 root root 4096 Feb  3 13:40 lib
lrwxrwxrwx  1 root root    9 Dez 10 14:29 man -> share/man
drwxr-xr-x  2 root root 4096 Jul 31  2020 sbin
drwxr-xr-x  7 root root 4096 Jul 31  2020 share
drwxr-xr-x  2 root root 4096 Jul 31  2020 src

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

(This is why we recommend using pip, because it is stable and clearml-agent takes care of pytorch/cuda verions)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Do you know how I can make sure I do not have CUDA or a broken installation installed?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

btw: I also tested the clearml-agent running on a different machine and with python 3.8 and I get the same problems.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Give me a minute

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

conda env update -p .clearml/venvs-builds/3.8 ./environment.yml

with environment.yml

name: clearml
channels:
  - pytorch
  - anaconda
  - conda-forge
  - defaults
dependencies:
  - pytorch==1.8.0

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Ha?!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701868901961728:profile|ReassuredTiger98>
Could you send the full log ? Also what's the clearml-agent version?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Could you send the end file?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I just started a task from this environment and it fails on the agent.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

fyi: NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

What's the difference between the two env files?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701868901961728:profile|ReassuredTiger98>
This should have worked, and seems like conda is not fetching the correct pytorch version (even though the conda env contains the cuda version they specify)
Let's try something, reset the Task, then edit the "Installed packages" and add:

cudatoolkit==11.1.1

Then try again.
Let's see what we get.
(The idea, is that I think conda forgets it just install cudatoolkit and assumes the env is for CPU)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could you try to do:

CUDA_VERSION="11.1" clearml-agent ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could you test with 4.7.5 ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Version 0.17.2 it says

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Will do!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Mhhm, now conda env creation takes forever since it probably resolves conflicts. At least that is what is happening when I tried to manually install my environment

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit==11.1.1
- pytorch==1.8.0

Gives CPU version

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I just wanna add: I can run this task on the same workstation with the same conda installation just fine.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

end file?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

The task already contains this

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Would it help you diagnose this problem if I ran conda env create --file=environment.yml and see whether it works?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

No worries, gnight :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I get 110 but it should be 111

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Complete conda log

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

So I just updated the env that clearml-agent created (and where pytorch cpu is installed) with my local environment.yml and now the correct version is installed, so most probably the `/tmp/conda_envaz1ne897.yml`` is the problem here

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Show more results

Write your answer

127K Views

161 Answers

4 years ago

one year ago