My apologies, let me rephrase:
if you are using pip ans package manager and not running in docker-mode, trains-agent
cannot touch the cuda/cuddn drivers (actually .so) library.
If you want to verify you can check echo $LD_LIBRARY_PATH
if you have cuda 10.2, then the torch 1.3.1 from the cu101 version should work
I am using pip as a package manager, but i start the trains-agent inside a conda env 😄
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
BTW: there is a fix to the priority thing:
https://github.com/allegroai/trains-agent/blob/216b3e21790659467007957d26172698fd74e075/docs/trains.conf#L77
Do you need to control the cuda drivers ?
What probably happens is first torch is installed via "trains-agent", then it installs the other packages and they require a different version, so pip automatically replaces it.
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
agent.package_manager.type = pip ... Using base prefix '/home/machine1/miniconda3/envs/py36' New python executable in /home/machine1/.trains/venvs-builds/3.6/bin/python3.6 Also creating executable in /home/machine1/.trains/venvs-builds/3.6/bin/python Installing setuptools, pip, wheel...
No worries, condatoolkit is not part of it. "trains-agent" will create a new clean venv for every experiment, and by default it will not inherit the system packages.
So basically I think you are "stuck" with the cuda drivers you have on the system
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
What happens is different error but it was so weird that I thought it was related to the version installed
You can set torch to be installed last:
post_packages: ["horovod", "torch"]
Which will make sure the "trains-agent" version (the one you specified in the "installed packages" will be installed last.
hoo thats cool! I could place torch==1.3.1 there
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
Wait, are you using conda as package manager ?
EDIT: meaning configured in trains.conf as package manager
alright I am starting to get a better picture of this puzzle
That's why I suspected trains was installing a different version that the one I expected
You can switch to docker-mode for better control over cuda drivers, or use conda and specify cudatoolkit (this feature will be part of the next RC, meanwhile it will install the cudatoolkit based on the global cuda_version).
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
(obviously if you have dependencies, they will be installed before, and then the correct torch will be installed over the previous version
Not really: I just need to find the one that is compatible with torch==1.3.1
JitteryCoyote63 I think this only holds for the conda distribution.
(Actually quite interesting, I wonder what happens if you already installed cudatoolkit...)
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch