I have been able to make use of
image: allegroai/trains-agent:latest
in the docker-compose file.yml 🎉
now I will focus on getting it working on Rancher
stay tuned
WickedGoat98
for such pods instantiating additional workers listening on queues
I would recommend to create a "devops" user and have its credentials spread across all agents. sounds good?
EDIT:
There is no limit on number of users on the system, so login as a new one and create credentials in the "profile" page :)
AgitatedDove14 regarding the credentials, will I need to take them out of my trains.conf, or might it be common practise to create a user for such pods instantiating additional workers listening on queues?
Thanks a lot. I will let you know if I manged it :)
One last thing make sure you spin the pod container with privileged mode, because the trains-agent docker will spin a sibling docker for your actual experiment.
Okay, so basically set a template for the pod, specifying the docker image. Make sure you pass the correct trains-server configuration (i.e. api/web/file server addresses and credentials), and select the queue name the agent will listen to.
container image / details
https://hub.docker.com/r/allegroai/trains-agent
https://github.com/allegroai/trains-agent/tree/master/docker/agent
Full environment variable list to pass can be found here:
https://github.com/allegroai/trains-server/blob/953124aa37dcf497297ca8fa62f0e6ba405cc83b/docker-compose.yml#L120
AgitatedDove14 ok, but how to deploy a trains-agent?
trains-agent should be deployed to GPU instances, not the trains-server.
The trains-agent purpose is for you to be able to send jobs to a GPU (at least in most cases) instance.
The "trains-server" is a control plane , basically telling the agent what to run (by storing the execution queue and tasks). Make sense ?
AgitatedDove14 I still do not understand, how I can deploy the trains-agent docker image to my trains-server installation so the 'default' queue will be handled.
Once I can do this, it should not be a big thing to add additional workers for more queues.
I found a template for k8s but as I'm quite new to Kubernetes I don't know how to use it.
As I use Rancher I'm able to even edit the trains-agent deployment. I added an additional command to handle the default queue as well, but it seems not to do so./bin/sh -c apt-get update ; apt-get install -y curl python3-pip git; curl -sSL
| sh ; python3 -m pip install -U pip ; python3 -m pip install trains-agent ; TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker nvidia/cuda --force-current-version ; TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon default --docker nvidia/cuda --force-current-version
I know that even if it would work, it would be overwritten the time I'm upgrading trains by helm.
Can you tell me how to get a trains-agen as a worker on a specific queue?
AgitatedDove14 today I managed to run what I couldn't a month before:)
I didn't understand correctly what you wrote me that time.
The issue I had was, that I missed wget in the trains-agent image and was not able to run a system call of wget.
Now I mannaged to do so based on your imput you gave me by adding theagent.docker_preprocess_bash_script = [...]
in my trains.config, and it worked out of the box 🙂
Basically this issue was the reason why I started learning how to create a Kubernetes Cluster, running Trains in it, ...
I thought I need to create a docker image including already the wget package service a queueu...
But this is not mandatory by the config option of the agen.
Nevertheless I will continue to reach the state being able to include own trains-agent service own queues, since I guess it might be needed in future;)
WickedGoat98 sorry, I missed the thread...
that the trains.conf has to be located on the node running the trains-agent.
Correct 🙂
The easiest way to check is to see if you can curl to the ip:port from the docker.
If you fail it is probably the wrong IP.
the IP you need to use is the IP of the machine running the docker-compose (not the IP of the docker inside that machine).
Make sense ?
I think I understand now, that the trains.conf has to be located on the node running the trains-agent.
When starting an additional trains-agent not been instantiated by docker-compose so it is not part of the same network, I get problems finding the api_server. localhost:8008 for sure will not be. I dentified the IP of the server running in docker with docker inspect ... and edited ~/trains.conf using it, but unfortunately it still cannot find the apiserver 😞(py38) wgo@NVidia-power:~/dev/allegro.ai$ docker inspect 3c20d2c2fe6e | grep -niE 'apiserver|IPAddress' 154: "TRAINS_API_HOST=
` ",
206: "SecondaryIPAddresses": null,
212: "IPAddress": "",
227: "IPAddress": "192.168.208.7",
(py38) wgo@NVidia-power:~/dev/allegro.ai$ trains-agent daemon --services-mode --detached --queue test --create-queue --docker ubuntu:18.04 --foreground
^C(py38) wgo@NVidia-power:~/dev/allegro.ai$ trains-agent daemon --services-mode --detached --queue test --create-queue --docker ubuntu:18.04 --foreground
trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server ?
(py38) wgo@NVidia-power:~/dev/allegro.ai$ `
WickedGoat98
Put the agent.docker_preprocess_bash_script
in the root of the file (i.e. you can just add the entire thing at the top of the trains.conf)
Might it be possible that I can place a trains.conf in the mapped local folder containing the filesystem and mongodb data etc e.g.
I'm assuming you are referring to the trains-=agent services, if this is the case, sure you can,
Edit your docker-compose.yml, under line https://github.com/allegroai/trains-server/blob/b93591ec322662156eab1ef90cf8151b81149488/docker-compose.yml#L142 add:- /opt/trains/trains.conf:/root/trains.conf
Now you can edit the trains.conf on the host machine at /opt/trains/trains.conf
AgitatedDove14 I tried editing the ~/trains.conf on the system I start the dockerized trains server & agent but without success.
I tried to add the script you provided insinde api and sdk scope as well as outside everything, the result is still the same, wget is missing :(api{ ... <here> } sdk{ ... <here> } <and here>
I'm quite sure I need to edit the trains file inside a docker container, but this will be part of the and even if I would be able to chenge it, not the solution I'm looking for.
Might it be possible that I can place a trains.conf in the mapped local folder containing the filesystem and mongodb data etc e.g. /opt/trains as the https://allegro.ai/docs/deploying_trains/trains_server_linux_mac/ proposes?
update:
I tried to add a trains.conf in /opt/trains/conf
with the content
agent.docker_preprocess_bash_script = [ "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean", "apt-get update", "apt-get install -y wget", "echo \"we have wget\"", ]
inside and outsine the api{} scope without success 😞
Thanks, will try on weekend to update the trains.conf
WickedGoat98 Basically you have two options:
Build a docker image with wget installed, then in the UI specify this image as "Base Docker Image" Configure the trains.conf
file on the machine running the trains-agent, with the above script. This will cause trains-agent
to install wget on any container it is running, so it is available for you to use (saving you the trouble of building your own container).With any of these two, by the time your code is executed, wget is installed and you will be able to call it with os.system
call.
What do you think?
AgitatedDove14 not sure how to make use of such config / where to add it
Is it to be added in the docker image when generating an own, or can I set this in the Web GUI as property of the experiment I cloned, shall it be added in the original script but type what kind of variable type is 'agent' of?
For example:agent.docker_preprocess_bash_script = [ "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean", "apt-get update", "apt-get install -y wget", "echo \"we have wget\"", ]
Hi WickedGoat98
A few background notions:
Docker do not store their state, so if you install something inside a docker, the moment you leave, it is gone, and the next time you start the same docker you start from the same initial setup. (This is a great feature of Dockers) It seems the docker you are using is missing wget. You could build a new docker (see the Docker website for more details on how to use a Dockerfile). The way trains-agent works in dockers is it installs everything you need inside the docker. If for example you always want to have wget, or maybe even use it, you can tell trains-agent to run a specific set of bash commands when it sets up the docker. See here: https://github.com/allegroai/trains-agent/blob/216b3e21790659467007957d26172698fd74e075/trains_agent/backend_api/config/default/agent.conf#L147