Hi, In One Of My Agents With Cuda Version: 11.1 (From Nvidia-Smi), Clearml Agent 0.17.1 Detects Version 100 (I Can See From Experiments Logs:

Answered

Hi, in one of my agents with CUDA Version: 11.1 (from nvidia-smi), clearml agent 0.17.1 detects version 100 (I can see from experiments logs: agent.cuda_version = 100 ). Then it downloads wheels accordingly to this wrong version
... Package(s) not found: torch Warning, could not locate PyTorch torch>=1.7 matching CUDA version 100, best candidate 1.0.0 Torch CUDA 92 download page found Trying PyTorch CUDA version 92 support Warning, could not locate PyTorch torch>=1.7 matching CUDA version 92, best candidate 1.0.0 Found PyTorch version torch>=1.7 matching CUDA version 92 Collecting torch==1.7.1+cu92 Downloading (577.3 MB) Saved ./.clearml/pip-download-cache/cu100/torch-1.7.1+cu92-cp36-cp36m-linux_x86_64.whl Successfully downloaded torch ...I probably can fix that by hardcoding agent.cuda_version = 110 in clearml.conf, right? Is there something to fix in the agent?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ok, this I cannot locate

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Interesting idea! (I assume for reporting only, not configuration)

Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded

regrading the cuda check with

nvcc

, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...

Ok, but when nvcc is not available, the agent uses the output from nvidia-smi right? On one of my machine, nvcc is not installed and in the experiment logs of the agent runnin there, agent.cuda = is the version shown with nvidia-smi

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

now I can do nvcc --version and I get
Cuda compilation tools, release 10.1, V10.1.243

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?
agent.cuda_driver_version = ... agent.cuda_runtime_version = ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

the conda sets up cuda I think

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

and the agent says agent.cudnn_version = 0

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ok, but when

nvcc

is not available, the agent uses the output from

nvidia-smi

right? On one of my machine,

nvcc

is not installed and in the experiment logs of the agent runnin there,

agent.cuda =

is the version shown with

nvidia-smi

Already added to the next agent's version 😉

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

ok yea now I see it

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

and with this setup I can use GPU without any problem, meaning that the wheel does contain the cuda runtime

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

can you initialize a tensor on the GPU?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

this is the cuda driver api. you need libcudart.so

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

note that the cuda driver was only recently added to nvidia-smi

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

so you dont have cuda installed 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

just to be clear, multiple CUDA runtime version can coexist on a single machine, and the only thing that points to which one you are using when running an application are the library search paths (which can be set either with LD_LIBRARY_PATH , or, preferably, by creating a file under /etc/ld.so.conf.d/ which contains the path to your cuda directory and executing ldconfig )

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

JitteryCoyote63 the agent.cuda_version (or CUDA_VERSION env) tell the agent which pytorch wheel to download. CUDNN library can be included inside any wheel and it will work as long as the cuda / cudart exist on the system, for example pytorch wheels include the cudnn they use . agent.cudnn_version should actually be deprecated, and is not actually used.

For future reference, dependency order:
Nvidia Drivers CUDA library and CUDA-runtime libraries (libcuda.so / libcudart.so) CUDNN library
(1) & (2) are usually system installed (or docker installed), (3) can have multiple versions in different locations (i.e. inside python packages)
If you are using dockers you can control (2) as it will be part of the docker.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

But I can do:
` $ python

import torch
torch.cuda.is_available()
True
torch.backends.cudnn.version()
8005 `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

yes what happens in the case of the installation with pip wheels files?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

agent.cuda_driver_version = ...
agent.cuda_runtime_version = ...

Interesting idea! (I assume for reporting only, not configuration)

... The agent mentionned used output from nvcc (2) ...

The dependencies I shared are not how the agent works, but how Nvidia CUDA works 🙂
regrading the cuda check with nvcc , I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

cudnn isn't cuda, it's a separate library.
are you running on docker on bare metal? you should have cuda installed at /usr/local/cuda-<>

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

I also did run sudo apt install nvidia-cuda-toolkit

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

try:
sudo updatedb locate libcudart

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

libcudart

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

JitteryCoyote63 I still don't understand what is the actual CUDA version you are using on your machine

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ExcitedFish86
				
					0
					 × 1

because I cannot locate libcudart or because cudnn_version = 0?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

2K Views

30 Answers

4 years ago

2 years ago