Could You Please Explain A Bit More How Trains Adapt The Torch Version Depending On The Installed Cuda Version? Here Is My Setup:

Answered

Could you please explain a bit more how trains adapt the torch version depending on the installed cuda version? Here is my setup:
cuda 102 installed and correctly detected by trains in my requirements: torch==1.3.1Since there is no wheel for this cuda version for this torch version, what will trains do?
What if I specify a direct wheel file as: torch @ https://download.pytorch.org/whl/cu101/torch-1.3.1-cp36-cp36m-linux_x86_64.whl , will trains handle it?

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

Yes I agree, but I get a strange error when using dataloaders:
RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

You can switch to docker-mode for better control over cuda drivers, or use conda and specify cudatoolkit (this feature will be part of the next RC, meanwhile it will install the cudatoolkit based on the global cuda_version).

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

You can set torch to be installed last:
post_packages: ["horovod", "torch"]
Which will make sure the "trains-agent" version (the one you specified in the "installed packages" will be installed last.

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

agent.package_manager.type = pip ... Using base prefix '/home/machine1/miniconda3/envs/py36' New python executable in /home/machine1/.trains/venvs-builds/3.6/bin/python3.6 Also creating executable in /home/machine1/.trains/venvs-builds/3.6/bin/python Installing setuptools, pip, wheel...

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

no, using the system drivers

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

alright I am starting to get a better picture of this puzzle

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

That's why I suspected trains was installing a different version that the one I expected

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

(obviously if you have dependencies, they will be installed before, and then the correct torch will be installed over the previous version

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

if you have cuda 10.2, then the torch 1.3.1 from the cu101 version should work

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?

Wait, are you using conda as package manager ?
EDIT: meaning configured in trains.conf as package manager

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

JitteryCoyote63 I think this only holds for the conda distribution.
(Actually quite interesting, I wonder what happens if you already installed cudatoolkit...)

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

i.e. change them per experiment ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

What happens is different error but it was so weird that I thought it was related to the version installed

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

yes, that's also what I thought

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

No worries, condatoolkit is not part of it. "trains-agent" will create a new clean venv for every experiment, and by default it will not inherit the system packages.
So basically I think you are "stuck" with the cuda drivers you have on the system

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

What probably happens is first torch is installed via "trains-agent", then it installs the other packages and they require a different version, so pip automatically replaces it.

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My apologies, let me rephrase:
if you are using pip ans package manager and not running in docker-mode, trains-agent cannot touch the cuda/cuddn drivers (actually .so) library.
If you want to verify you can check echo $LD_LIBRARY_PATH

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sorry, I didn't get that 😄

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I am using pip as a package manager, but i start the trains-agent inside a conda env 😄

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

hoo thats cool! I could place torch==1.3.1 there

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

😞

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BTW: there is a fix to the priority thing:
https://github.com/allegroai/trains-agent/blob/216b3e21790659467007957d26172698fd74e075/docs/trains.conf#L77

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Not really: I just need to find the one that is compatible with torch==1.3.1

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Do you need to control the cuda drivers ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

1K Views

30 Answers

4 years ago

2 years ago