this is the error
Running Docker:
Executing: ('docker', 'run', '-t', '--gpus', 'device=0,1', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu0,1', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.li48l7ii.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.uv6dxcw7:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/home/lv-beast/.trains/pip-download-cache:/root/.trains/pip-download-cache', '-v', '/home/lv-beast/.trains/cache:/trains_agent_cache', '-v', '/home/lv-beast/.trains/vcs-cache.2:/root/.trains/vcs-cache', '--rm', 'nvidia/cuda:10.0-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U trains-agent==0.16.2rc0 ; cp /root/trains.conf /root/default_trains.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m trains_agent execute --disable-monitoring --id 42f368a906d5447f91447ce78897ca0f')
docker: Error response from daemon: cannot set both Count and DeviceIDs on device request.
Can you share the exception for --gpus "0,1"
?
yes, when I run docker itself
docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it work, but when I do with trains like WackyRabbit7 suggested (with same quotes):
trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
it gives this error:
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
So running the docker with ‘“device=0,1”’ works? We will check that
when I launch this:
(trains-agent) lv-beast@lv-beast:~/dev/MachineLearning/scripts/cmd_launcer$ docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it worked, so maybe its an issue with how trains pass the device to the docker run command?
WackyRabbit7 thanks for the suggestions
the first suggestion (without the quote) get the same result.
the second produce
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
(this produce the execute command)
Executing: ('docker', 'run', '-t', '--gpus', 'device="device=0,1"', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu"device=0,1"', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.qqiy2k_0.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.5_ywyle_:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/home/lv-beast/.trains/pip-download-cache:/root/.trains/pip-download-cache', '-v', '/home/lv-beast/.trains/cache:/trains_agent_cache', '-v', '/home/lv-beast/.trains/vcs-cache.2:/root/.trains/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U trains-agent==0.16.2rc0 ; cp /root/trains.conf /root/default_trains.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m trains_agent execute --disable-monitoring --id 7b0abf01c13a4284ac51a06b9589b691')
I assume trains passes it as is, so I think the quoting I mentioned might work
In standard docker TimelyPenguin76 this quoting you mentioned is wrong, since the whole argument is being passed - hence the double tricky quotation I posted above
Did you get the same error message? What do you have in the error under ‘device=XXX’?
What about this? Do you get the same as the first one ( device=0,1
)? or with quote ( device="0,1"
)?
You should try trains-agent daemon --gpus device=0,1 --queue dual_gpu --docker --foreground
and if it doesn't work try quoting trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
you are right, I have only 2 gpus right now, so basically I can launch --gpus all and it will work
but I want to create the scripts for longer use (deploy on larger machines with more gpus)
docker:
Client: Docker Engine - Community
Version: 19.03.6
API version: 1.40
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:27:49 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.6
API version: 1.40 (minimum version 1.12)
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:26:21 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.12
GitCommit: 35bd7a5f69c13e1563af8a93431411cd9ecf5021
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
os version:
lv-beast@lv-beast:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
Driver Version: 440.100
You can try running the trains-agent
with --gpus all
if you have two gpus on your machine.
The flag --gpus all
is used to assign all available gpus to the docker container.
Did you get the same error message? What do you have in the error under ‘device=XXX’?
What is the OS / docker / Nvidia drivers you have on the machine?
maybe it's possible to overcome this by setting NVIDIA_VISIBLE_DEVICES somehow, and then use --gpus all?
thanks for the help!
I tried now:
trains-agent daemon --gpus "0,1" --queue dual_gpu --docker --foreground
but I get the same error when I execute train
Hi RattySeagull0 ,
Can you try quote the gpus numbers? like --gpus "0,1"
?
looks the same issue as https://github.com/allegroai/trains-agent/issues/35