Hi, I Need Your Help Setting Up An Trains Agent Running In Docker. I Have An Python Script Calling Wget As System Command Which Runs Fine On My Dev Engine. When Cloning The Experiment And Scheduling It Into The Services Queue I Get An Error That The Call

Answered

Hi,
I need your help setting up an trains agent running in docker.
I have an python script calling wget as system command which runs fine on my dev engine.
When cloning the experiment and scheduling it into the services queue I get an error that the call failed
sh: 1: wget: not foundI entered the docker container and installed wget but the result is still the same.
So it seems, inside docker when running a queued experiment, another docker image is been used to run the experiment in, and this is lacking wget.
I thought about creating an own docker image based on an official docker image of allegroai, add wget and add overwrite the entrypoint script creating a new queue been listening on, but I'm not sure
which official image I should use since there are multiple provided by allegroai like

unfortunately the Dockerfiles are not been published on dockers hub
On http://github.com I found one agent dockerfile in a subfolder
https://github.com/allegroai/trains-agent/tree/master/docker
maybe this is the one to be used.
The question basically is, in which container is an experiment been lanched, and do I need to provide an enhanced docker image or is it configurable to add additional software packages?
Might I install the needed ubuntu package out of the script I'm running? I expect the user been running it will not have sufficient privileges to do so.
So,
a) how to create docker images been usable as trains-agent creating and listening on additional queues. Can I just copy my trains.conf to the workdir inside the docker image?
b) is there a way to add additional SW packages to the environment an experiment shall run in?
thanks
Wasili

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Votes Newest

Answers 23

Hi WickedGoat98
A few background notions:
Docker do not store their state, so if you install something inside a docker, the moment you leave, it is gone, and the next time you start the same docker you start from the same initial setup. (This is a great feature of Dockers) It seems the docker you are using is missing wget. You could build a new docker (see the Docker website for more details on how to use a Dockerfile). The way trains-agent works in dockers is it installs everything you need inside the docker. If for example you always want to have wget, or maybe even use it, you can tell trains-agent to run a specific set of bash commands when it sets up the docker. See here: https://github.com/allegroai/trains-agent/blob/216b3e21790659467007957d26172698fd74e075/trains_agent/backend_api/config/default/agent.conf#L147

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay, so basically set a template for the pod, specifying the docker image. Make sure you pass the correct trains-server configuration (i.e. api/web/file server addresses and credentials), and select the queue name the agent will listen to.

container image / details
https://hub.docker.com/r/allegroai/trains-agent

https://github.com/allegroai/trains-agent/tree/master/docker/agent

Full environment variable list to pass can be found here:
https://github.com/allegroai/trains-server/blob/953124aa37dcf497297ca8fa62f0e6ba405cc83b/docker-compose.yml#L120

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

WickedGoat98 Basically you have two options:
Build a docker image with wget installed, then in the UI specify this image as "Base Docker Image" Configure the trains.conf file on the machine running the trains-agent, with the above script. This will cause trains-agent to install wget on any container it is running, so it is available for you to use (saving you the trouble of building your own container).With any of these two, by the time your code is executed, wget is installed and you will be able to call it with os.system call.
What do you think?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

One last thing make sure you spin the pod container with privileged mode, because the trains-agent docker will spin a sibling docker for your actual experiment.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 not sure how to make use of such config / where to add it
Is it to be added in the docker image when generating an own, or can I set this in the Web GUI as property of the experiment I cloned, shall it be added in the original script but type what kind of variable type is 'agent' of?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

AgitatedDove14 I tried editing the ~/trains.conf on the system I start the dockerized trains server & agent but without success.
I tried to add the script you provided insinde api and sdk scope as well as outside everything, the result is still the same, wget is missing :(
api{ ... <here> } sdk{ ... <here> } <and here>I'm quite sure I need to edit the trains file inside a docker container, but this will be part of the and even if I would be able to chenge it, not the solution I'm looking for.
Might it be possible that I can place a trains.conf in the mapped local folder containing the filesystem and mongodb data etc e.g. /opt/trains as the https://allegro.ai/docs/deploying_trains/trains_server_linux_mac/ proposes?

update:
I tried to add a trains.conf in /opt/trains/conf
with the content

agent.docker_preprocess_bash_script = [ "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean", "apt-get update", "apt-get install -y wget", "echo \"we have wget\"", ]
inside and outsine the api{} scope without success 😞

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

AgitatedDove14 I still do not understand, how I can deploy the trains-agent docker image to my trains-server installation so the 'default' queue will be handled.
Once I can do this, it should not be a big thing to add additional workers for more queues.
I found a template for k8s but as I'm quite new to Kubernetes I don't know how to use it.
As I use Rancher I'm able to even edit the trains-agent deployment. I added an additional command to handle the default queue as well, but it seems not to do so.
/bin/sh -c apt-get update ; apt-get install -y curl python3-pip git; curl -sSL | sh ; python3 -m pip install -U pip ; python3 -m pip install trains-agent ; TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker nvidia/cuda --force-current-version ; TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon default --docker nvidia/cuda --force-current-versionI know that even if it would work, it would be overwritten the time I'm upgrading trains by helm.

Can you tell me how to get a trains-agen as a worker on a specific queue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

For example:
agent.docker_preprocess_bash_script = [ "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean", "apt-get update", "apt-get install -y wget", "echo \"we have wget\"", ]

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks, will try on weekend to update the trains.conf

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

I have been able to make use of
image: allegroai/trains-agent:latest
in the docker-compose file.yml 🎉
now I will focus on getting it working on Rancher
stay tuned

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

WickedGoat98 sorry, I missed the thread...

that the trains.conf has to be located on the node running the trains-agent.

Correct 🙂
The easiest way to check is to see if you can curl to the ip:port from the docker.
If you fail it is probably the wrong IP.
the IP you need to use is the IP of the machine running the docker-compose (not the IP of the docker inside that machine).
Make sense ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

trains-agent should be deployed to GPU instances, not the trains-server.
The trains-agent purpose is for you to be able to send jobs to a GPU (at least in most cases) instance.
The "trains-server" is a control plane , basically telling the agent what to run (by storing the execution queue and tasks). Make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

👍

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

I think I understand now, that the trains.conf has to be located on the node running the trains-agent.
When starting an additional trains-agent not been instantiated by docker-compose so it is not part of the same network, I get problems finding the api_server. localhost:8008 for sure will not be. I dentified the IP of the server running in docker with docker inspect ... and edited ~/trains.conf using it, but unfortunately it still cannot find the apiserver 😞
(py38) wgo@NVidia-power:~/dev/allegro.ai$ docker inspect 3c20d2c2fe6e | grep -niE 'apiserver|IPAddress' 154: "TRAINS_API_HOST= ` ",
206: "SecondaryIPAddresses": null,
212: "IPAddress": "",
227: "IPAddress": "192.168.208.7",
(py38) wgo@NVidia-power:~/dev/allegro.ai$ trains-agent daemon --services-mode --detached --queue test --create-queue --docker ubuntu:18.04 --foreground
^C(py38) wgo@NVidia-power:~/dev/allegro.ai$ trains-agent daemon --services-mode --detached --queue test --create-queue --docker ubuntu:18.04 --foreground

trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server ?

(py38) wgo@NVidia-power:~/dev/allegro.ai$ `

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

AgitatedDove14 regarding the credentials, will I need to take them out of my trains.conf, or might it be common practise to create a user for such pods instantiating additional workers listening on queues?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

you want to use k8s ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks a lot. I will let you know if I manged it :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

AgitatedDove14 today I managed to run what I couldn't a month before:)
I didn't understand correctly what you wrote me that time.
The issue I had was, that I missed wget in the trains-agent image and was not able to run a system call of wget.
Now I mannaged to do so based on your imput you gave me by adding the
agent.docker_preprocess_bash_script = [...]in my trains.config, and it worked out of the box 🙂
Basically this issue was the reason why I started learning how to create a Kubernetes Cluster, running Trains in it, ...
I thought I need to create a docker image including already the wget package service a queueu...
But this is not mandatory by the config option of the agen.
Nevertheless I will continue to reach the state being able to include own trains-agent service own queues, since I guess it might be needed in future;)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

WickedGoat98
Put the agent.docker_preprocess_bash_script in the root of the file (i.e. you can just add the entire thing at the top of the trains.conf)

Might it be possible that I can place a trains.conf in the mapped local folder containing the filesystem and mongodb data etc e.g.

I'm assuming you are referring to the trains-=agent services, if this is the case, sure you can,
Edit your docker-compose.yml, under line https://github.com/allegroai/trains-server/blob/b93591ec322662156eab1ef90cf8151b81149488/docker-compose.yml#L142 add:
- /opt/trains/trains.conf:/root/trains.confNow you can edit the trains.conf on the host machine at /opt/trains/trains.conf

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Nice!!!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 ok, but how to deploy a trains-agent?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

WickedGoat98

for such pods instantiating additional workers listening on queues

I would recommend to create a "devops" user and have its credentials spread across all agents. sounds good?

EDIT:
There is no limit on number of users on the system, so login as a new one and create credentials in the "profile" page :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

23 Answers

5 years ago

2 years ago