I am using pip as a package manager, but i start the trains-agent inside a conda env 😄
Do you need to control the cuda drivers ?
No worries, condatoolkit is not part of it. "trains-agent" will create a new clean venv for every experiment, and by default it will not inherit the system packages.
So basically I think you are "stuck" with the cuda drivers you have on the system
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
What happens is different error but it was so weird that I thought it was related to the version installed
You can set torch to be installed last:
post_packages: ["horovod", "torch"]
Which will make sure the "trains-agent" version (the one you specified in the "installed packages" will be installed last.
hoo thats cool! I could place torch==1.3.1 there
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
Wait, are you using conda as package manager ?
EDIT: meaning configured in trains.conf as package manager
alright I am starting to get a better picture of this puzzle
You can switch to docker-mode for better control over cuda drivers, or use conda and specify cudatoolkit (this feature will be part of the next RC, meanwhile it will install the cudatoolkit based on the global cuda_version).
Not really: I just need to find the one that is compatible with torch==1.3.1
JitteryCoyote63 I think this only holds for the conda distribution.
(Actually quite interesting, I wonder what happens if you already installed cudatoolkit...)
My apologies, let me rephrase:
if you are using pip ans package manager and not running in docker-mode, trains-agent
cannot touch the cuda/cuddn drivers (actually .so) library.
If you want to verify you can check echo $LD_LIBRARY_PATH
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
What probably happens is first torch is installed via "trains-agent", then it installs the other packages and they require a different version, so pip automatically replaces it.
That's why I suspected trains was installing a different version that the one I expected
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
BTW: there is a fix to the priority thing:
https://github.com/allegroai/trains-agent/blob/216b3e21790659467007957d26172698fd74e075/docs/trains.conf#L77
(obviously if you have dependencies, they will be installed before, and then the correct torch will be installed over the previous version
if you have cuda 10.2, then the torch 1.3.1 from the cu101 version should work
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
agent.package_manager.type = pip ... Using base prefix '/home/machine1/miniconda3/envs/py36' New python executable in /home/machine1/.trains/venvs-builds/3.6/bin/python3.6 Also creating executable in /home/machine1/.trains/venvs-builds/3.6/bin/python Installing setuptools, pip, wheel...