Hi Everyone, I Am Trying To Use Docker Mode For Trains-Agent, But It Seems That It Has Problem With The Use Of Multiple Gpus This Is My Trains-Agent Command: Trains-Agent Daemon --Gpus 0,1 --Queue Dual_Gpu --Docker --Foreground And It Gets The Error: Doc

Answered

Hi everyone,
I am trying to use docker mode for trains-agent, but it seems that it has problem with the use of multiple gpus

this is my trains-agent command: trains-agent daemon --gpus 0,1 --queue dual_gpu --docker --foreground
and it gets the error:
docker: Error response from daemon: cannot set both Count and DeviceIDs on device request.

this is the executed command:
Running Task 9a27c51ed4e547a8b59240c962007308 inside docker: nvidia/cuda:10.1-runtime-ubuntu18.04
Running Docker:
Executing: ('docker', 'run', '-t', '--gpus', 'device=0,1', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu0,1', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.qqls6c0w.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.yokigdt6:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/home/lv-beast/.trains/pip-download-cache:/root/.trains/pip-download-cache', '-v', '/home/lv-beast/.trains/cache:/trains_agent_cache', '-v', '/home/lv-beast/.trains/vcs-cache.2:/root/.trains/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U trains-agent==0.16.2rc0 ; cp /root/trains.conf /root/default_trains.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m trains_agent execute --disable-monitoring --id 9a27c51ed4e547a8b59240c962007308')

docker: Error response from daemon: cannot set both Count and DeviceIDs on

if someone familiar with this topic I would be happy for help 🙂
( TimelyPenguin76 helped me last time)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Votes Newest

Answers 17

this is the error
Running Docker:

Executing: ('docker', 'run', '-t', '--gpus', 'device=0,1', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu0,1', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.li48l7ii.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.uv6dxcw7:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/home/lv-beast/.trains/pip-download-cache:/root/.trains/pip-download-cache', '-v', '/home/lv-beast/.trains/cache:/trains_agent_cache', '-v', '/home/lv-beast/.trains/vcs-cache.2:/root/.trains/vcs-cache', '--rm', 'nvidia/cuda:10.0-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U trains-agent==0.16.2rc0 ; cp /root/trains.conf /root/default_trains.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m trains_agent execute --disable-monitoring --id 42f368a906d5447f91447ce78897ca0f')

docker: Error response from daemon: cannot set both Count and DeviceIDs on device request.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Can you share the exception for --gpus "0,1" ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

yes, when I run docker itself
docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi

it work, but when I do with trains like WackyRabbit7 suggested (with same quotes):
trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground

it gives this error:
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

So running the docker with ‘“device=0,1”’ works? We will check that

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

when I launch this:
(trains-agent) lv-beast@lv-beast:~/dev/MachineLearning/scripts/cmd_launcer$ docker run --gpus '"device=0,1"' nvidia/cuda:10.1-base nvidia-smi
it worked, so maybe its an issue with how trains pass the device to the docker run command?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

:face_palm: 🤔 :man-tipping-hand:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

WackyRabbit7 thanks for the suggestions
the first suggestion (without the quote) get the same result.
the second produce
invalid argument "device="device=0,1"" for "--gpus" flag: parse error on line 1, column 7: bare " in non-quoted-field
(this produce the execute command)
Executing: ('docker', 'run', '-t', '--gpus', 'device="device=0,1"', '-e', 'TRAINS_WORKER_ID=lv-beast:gpu"device=0,1"', '-v', '/home/lv-beast/.git-credentials:/root/.git-credentials', '-v', '/home/lv-beast/.gitconfig:/root/.gitconfig', '-v', '/tmp/.trains_agent.qqiy2k_0.cfg:/root/trains.conf', '-v', '/tmp/trains_agent.ssh.5_ywyle_:/root/.ssh', '-v', '/home/lv-beast/.trains/apt-cache.2:/var/cache/apt/archives', '-v', '/home/lv-beast/.trains/pip-cache:/root/.cache/pip', '-v', '/home/lv-beast/.trains/pip-download-cache:/root/.trains/pip-download-cache', '-v', '/home/lv-beast/.trains/cache:/trains_agent_cache', '-v', '/home/lv-beast/.trains/vcs-cache.2:/root/.trains/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U trains-agent==0.16.2rc0 ; cp /root/trains.conf /root/default_trains.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m trains_agent execute --disable-monitoring --id 7b0abf01c13a4284ac51a06b9589b691')

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Sounds right, thanks 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

I assume trains passes it as is, so I think the quoting I mentioned might work

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

In standard docker TimelyPenguin76 this quoting you mentioned is wrong, since the whole argument is being passed - hence the double tricky quotation I posted above

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Did you get the same error message? What do you have in the error under ‘device=XXX’?

What about this? Do you get the same as the first one ( device=0,1 )? or with quote ( device="0,1" )?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

You should try trains-agent daemon --gpus device=0,1 --queue dual_gpu --docker --foreground and if it doesn't work try quoting trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

you are right, I have only 2 gpus right now, so basically I can launch --gpus all and it will work
but I want to create the scripts for longer use (deploy on larger machines with more gpus)

docker:
Client: Docker Engine - Community
Version: 19.03.6
API version: 1.40
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:27:49 2020
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.6
API version: 1.40 (minimum version 1.12)
Go version: go1.12.16
Git commit: 369ce74a3c
Built: Thu Feb 13 01:26:21 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.12
GitCommit: 35bd7a5f69c13e1563af8a93431411cd9ecf5021
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683

os version:
lv-beast@lv-beast:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"

Driver Version: 440.100

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

You can try running the trains-agent with --gpus all if you have two gpus on your machine.

The flag --gpus all is used to assign all available gpus to the docker container.

Did you get the same error message? What do you have in the error under ‘device=XXX’?

What is the OS / docker / Nvidia drivers you have on the machine?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

maybe it's possible to overcome this by setting NVIDIA_VISIBLE_DEVICES somehow, and then use --gpus all?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

thanks for the help!
I tried now:
trains-agent daemon --gpus "0,1" --queue dual_gpu --docker --foreground

but I get the same error when I execute train

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Hi RattySeagull0 ,

Can you try quote the gpus numbers? like --gpus "0,1" ?
looks the same issue as https://github.com/allegroai/trains-agent/issues/35

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Write your answer

1K Views

17 Answers

4 years ago

one year ago