Hi Martin,
you are right. The Trains-agent is running with option cpu-only(py38) wgo@NVidia-power:~/dev/catwalk$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b99d5103a43c allegroai/trains-agent-services:latest "/usr/agent/entrypoi…" 2 days ago Up 2 days trains-agent-services 16d20b75acf9 allegroai/trains:latest "/opt/trains/wrapper…" 2 days ago Up 2 days 8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp trains-webserver 205af33b09b1 allegroai/trains:latest "/opt/trains/wrapper…" 2 days ago Up 2 days 0.0.0.0:8008->8008/tcp, 8080-8081/tcp trains-apiserver 695f57cd5b16 allegroai/trains:latest "/opt/trains/wrapper…" 2 days ago Up 2 days 8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp trains-fileserver 9e85517ec9f7 redis:5.0 "docker-entrypoint.s…" 2 days ago Up 2 days 0.0.0.0:6379->6379/tcp trains-redis 9719ab098a42 docker.elastic.co/elasticsearch/elasticsearch:7.6.2 "/usr/local/bin/dock…" 2 days ago Up 2 days 0.0.0.0:9200->9200/tcp, 9300/tcp trains-elastic 17f250415e92 mongo:3.6.5 "docker-entrypoint.s…" 2 days ago Up 2 days 0.0.0.0:27017->27017/tcp trains-mongo (py38) wgo@NVidia-power:~/dev/catwalk$ docker exec -it b99d5103a43c bash root@b99d5103a43c:/usr/agent# ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:00 /bin/sh /usr/agent/entrypoint.sh 11 ? Sl 18:39 /usr/bin/python3 /usr/local/bin/trains-agent daemon --services-mode --queue services --create-queue --docker ubuntu:18.04 --cpu-only 17 pts/0 Ss 0:00 bash 31 pts/0 R+ 0:00 ps ax root@b99d5103a43c:/usr/agent#
I followed the instructions on https://allegro.ai/docs/deploying_trains/trains_server_linux_mac/ running it in docker.
Unfortunately I can't find any info on how to configure the container
- how can I enable the tensorboard and have the graphs been stored in trains?
Another point I see is, that in the workers & queses view the GPU usage is not been reported
It should be reported, if it is not, maybe you are running the trains-agent
in cpu mode ? (try adding --gpus)
regarding the clean-up servide, do I need to run this as cron job, or does the trains server support a kind of add-ons where I need to copy the script to?
I ran an local (not dockerized) trains-agenttrains-agent daemon --queue training --create-queue --foreground
which enabled me to see the GPU load on the corresponding view 🙂
Now I got another issue.
It seems when cloning an experiment, a virtual environment is been created with all the modules been identified to be used. Inside this environment the experiment is running.
Am I right?
Is this the case only for clones?
In my Python code I'm trying to read a pandas table which I stored in parque format. Unfortunately when running the clone (with changed parameter) I get an exception caused by a missing package
` raise ImportError(
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
- Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
- Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
This I also had on my development system when I started using the https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet format. Pandas needs a backend to be installed being able to handle parquet format. What I'm using locally is https://fastparquet.readthedocs.io/en/latest/install.html is loaded on demand by pandas. So I havent added an
import fastparquet `explicitly in the code (I will do this soon to see if it resolves the exception).
But I wonder about the exception raising only on cloned experiments.
While writing this I think I understand it now. Running a script locally uses whatever has been installed locally and by instantiating a task the streams are redirected, configurations are analyzed and stored, ...
When cloning experiments, they are been re-constructed out of this information and are running in an isolated environment. If needed packages have not been identified as such, they are missing ...
Well, realy cool stuff this Trains product 👍
Looking forward to dive deaper to it
after adding the
import fastparquet
statement to the code, the reconstruction of an clone is working
` Summary - installed python packages:
...
- fastparquet==0.4.1
...
Environment setup completed successfully
Starting Task Execution:
...
modeller.py: error: the following arguments are required: --algorithm `unfortunately it raises the next issue.
If the script been used expects to get parameters via command line (which in Trains experiments are identified and stored as parameter when using argparse) it fails to start 😞
I'm sure you have a solution for this.
I could add an option enabling Trains to provide the parameters after command line parsing, but how are the parameters fit to the script?
models been trained stored ...
mongodb will store url links, the upload itself is controlled via the "output_uri" argument to the Task
If None is provided, the Trains log the local stored model (i.e. link to where you stored your model), if you provide one, Trains will automatically upload the model (into a new subfolder) and store the link to that subfolder.
- how can I enable the tensorboard and have the graphs been stored in trains?
Basically if you call Task.init all your TB is automatically also logged by trains (obviously you still have the TB files locally)
ok thanks, will need to run some tests later
WickedGoat98 the mechanism of cloning and parameter overriding is working only when the trains-agent
is launching the experiment. Think of it this way:
Manual execution: trains sends data to server
Automatic (trains-agent) execution: trains pulls data from the server
This applies for both the argparse and connect and connect configuration.
The trains code itself is acting differently when it is executed from the 'trains-agent' context.
Does that help clear things ?
Sorry, but I don'T understand how the cloned experiment is been provided with parameters.
A task which is been cloned by Trains might get its parameter via task.set_parameters(dict)
this parameters are comming from soe magic analysis of the argparse been used in the script.
AgitatedDove14 when is the call to set_parameter(...) been performed? Is the argparse call been somehow redirected and will receive the data from Trains instead of getting them via sys.argv or wherever argparse is getting them from? If so, why my cloned experiment is reporting missing mandatory arguments?Starting Task Execution: TRAINS results page:
usage: modeller.py [-h] [-v VERBOSE] [-s MONGODB_SERVER] [-a ASSET] [-d DATABASE] [-f FEATURE_COLLECTION] [-t TARGET_COLLECTION] [-c NR_CORES] [-m MODEL_ROOT] --algorithm ALGORITHM [ALGORITHM ...] [--use_trains] [--epochs EPOCHS] [--tracing] modeller.py: error: the following arguments are required: --algorithm
WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent
readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent
Hi WickedGoat98
but is there also a way to delete them, or wipe complete projects?
https://github.com/allegroai/trains/issues/16
Auto cleanup service here:
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py
btw: at https://allegro.ai/docs/task.html#task.Task.enqueue the link to the 'Use Case Examples' is broken
well I managed to clone an experiment and adat its parameter on the trains server via browser.
If argparse is been used, no parameter must be defined as required. Instead it has to be managed by the script after parsing the parameter and something mandatory is missing to terminate.
Doing so worked fine for me 😁 at least for this part of work. Now fastparquet and missing packages are failing again...
another question I have is, are the models been trained stored (I guess they are stored) in the mongodb or in the file system and which format is been used ?