Reputation
Badges 1
21 × Eureka!I use this docker nvidia/cuda:10.0-runtime-ubuntu18.04, I'm docker noob so far, so I will try to search, I assumed it installed python3.6 because it appears in the trains.conf
do you know if it just coming with python3.6?
yes, when I run docker itself
docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it work, but when I do with trains like WackyRabbit7 suggested (with same quotes):
trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
it gives this error:
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
this is the error
Running Docker:
Executing: ('docker', 'run', '-t', '--gpus', 'device=0,1', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu0,1', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.li48l7ii.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.uv6dxcw7:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/...
I did, and it installed the docker with python 3.6 (I think because the parameter of agent.default_python is 3.6 by default)
is it possible to change this parameter when I create the experiment? (I want to work with python 3.7)
when my system was "clean" I installed cuda 10.1 (never installed cuda 10.2) hope i'm not mistaken
thanks for the help!
I tried now:
trains-agent daemon --gpus "0,1" --queue dual_gpu --docker --foreground
but I get the same error when I execute train
you are right, I have only 2 gpus right now, so basically I can launch --gpus all and it will work
but I want to create the scripts for longer use (deploy on larger machines with more gpus)
docker:
Client: Docker Engine - Community
Version: 19.03.6
API version: 1.40
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:27:49 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
V...
maybe it's possible to overcome this by setting NVIDIA_VISIBLE_DEVICES somehow, and then use --gpus all?
WackyRabbit7 thanks for the suggestions
the first suggestion (without the quote) get the same result.
the second produce
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
(this produce the execute command)
Executing: ('docker', 'run', '-t', '--gpus', 'device="device=0,1"', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu"device=0,1"', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/roo...
when I launch this:
(trains-agent) lv-beast@lv-beast:~/dev/MachineLearning/scripts/cmd_launcer$ docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it worked, so maybe its an issue with how trains pass the device to the docker run command?
The version of the cudatoolkit is 10.1 inside the experiment, and trains try to work with 10.2, probably because the same reason it displays in the nvidia-smi
I can give it a shot (I'm using conda now) what is the overhead of going into dockers with the fact that I dont have "docker hands on experience"?
got it thanks!
Is it possible to use different dockers (containing different cuda versions) in different experiments?
or I have to open different queues for that? (or something like that)
Is it something that I can config from the call to task.init? (my goal is that I wont be required to change in manualy)
ye I want especially python 3.7, I will try to get another docker with python 3.7 somehow
Didnt use it so far, but I will start 🙂
thanks AgitatedDove14 , I will try to use docker with pip as package manager and see if it will solve my issues
is the flow using dockers is more supported than conda? is there a guide regarding the configuration required for dockers?
Hi TimelyPenguin76
you are right, it written cuda version 10.2 (even though I installed only cuda 10.1, weird)
do you know why it's 10.2?
and do you know why trains count on that? (instead of looking in the python environment of the executed script?)