Reputation
Badges 1
21 × Eureka!when I launch this:
(trains-agent) lv-beast@lv-beast:~/dev/MachineLearning/scripts/cmd_launcer$ docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it worked, so maybe its an issue with how trains pass the device to the docker run command?
maybe it's possible to overcome this by setting NVIDIA_VISIBLE_DEVICES somehow, and then use --gpus all?
I did, and it installed the docker with python 3.6 (I think because the parameter of agent.default_python is 3.6 by default)
is it possible to change this parameter when I create the experiment? (I want to work with python 3.7)
ye I want especially python 3.7, I will try to get another docker with python 3.7 somehow
thanks AgitatedDove14 , I will try to use docker with pip as package manager and see if it will solve my issues
I use this docker nvidia/cuda:10.0-runtime-ubuntu18.04, I'm docker noob so far, so I will try to search, I assumed it installed python3.6 because it appears in the trains.conf
do you know if it just coming with python3.6?
The version of the cudatoolkit is 10.1 inside the experiment, and trains try to work with 10.2, probably because the same reason it displays in the nvidia-smi
Didnt use it so far, but I will start 🙂
Is it something that I can config from the call to task.init? (my goal is that I wont be required to change in manualy)
thanks for the help!
I tried now:
trains-agent daemon --gpus "0,1" --queue dual_gpu --docker --foreground
but I get the same error when I execute train
is the flow using dockers is more supported than conda? is there a guide regarding the configuration required for dockers?
this is the error
Running Docker:
Executing: ('docker', 'run', '-t', '--gpus', 'device=0,1', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu0,1', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.li48l7ii.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.uv6dxcw7:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/...
Hi TimelyPenguin76
you are right, it written cuda version 10.2 (even though I installed only cuda 10.1, weird)
do you know why it's 10.2?
and do you know why trains count on that? (instead of looking in the python environment of the executed script?)
when my system was "clean" I installed cuda 10.1 (never installed cuda 10.2) hope i'm not mistaken
got it thanks!
Is it possible to use different dockers (containing different cuda versions) in different experiments?
or I have to open different queues for that? (or something like that)
I can give it a shot (I'm using conda now) what is the overhead of going into dockers with the fact that I dont have "docker hands on experience"?
WackyRabbit7 thanks for the suggestions
the first suggestion (without the quote) get the same result.
the second produce
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
(this produce the execute command)
Executing: ('docker', 'run', '-t', '--gpus', 'device="device=0,1"', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu"device=0,1"', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/roo...
yes, when I run docker itself
docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it work, but when I do with trains like WackyRabbit7 suggested (with same quotes):
trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
it gives this error:
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
you are right, I have only 2 gpus right now, so basically I can launch --gpus all and it will work
but I want to create the scripts for longer use (deploy on larger machines with more gpus)
docker:
Client: Docker Engine - Community
Version: 19.03.6
API version: 1.40
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:27:49 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
V...