Hi, I Run The Trains Server In An Docker Container And Started Making Use Of Tasks ... My Tests Are Showed On The Projects Dashboard Which Is Realy Cool. What I Haven'T Found So Far Is A Way To Clean Up The System From The Tests I Did. I'M Able To Archive

Answered

Hi, I run the trains server in an docker container and started making use of tasks ...
My tests are showed on the Projects dashboard which is realy cool.
What I haven't found so far is a way to clean up the system from the tests I did. I'm able to archive the experiments, but is there also a way to delete them, or wipe complete projects?
Another point I see is, that in the workers & queses view the GPU usage is not been reported. Only CPU usage is been displayed. Do I need to configure the docker image somehow to get also the GPU load visisble? On a shell I can see significant GPU uses with nvop while running experiments, but nothing (even not 0 load) in the workers view.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Votes Newest

Answers 17

ok will read it later

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

I ran an local (not dockerized) trains-agent
trains-agent daemon --queue training --create-queue --foregroundwhich enabled me to see the GPU load on the corresponding view 🙂

Now I got another issue.
It seems when cloning an experiment, a virtual environment is been created with all the modules been identified to be used. Inside this environment the experiment is running.
Am I right?
Is this the case only for clones?

In my Python code I'm trying to read a pandas table which I stored in parque format. Unfortunately when running the clone (with changed parameter) I get an exception caused by a missing package

` raise ImportError(
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:

Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet. This I also had on my development system when I started using the https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet format. Pandas needs a backend to be installed being able to handle parquet format. What I'm using locally is https://fastparquet.readthedocs.io/en/latest/install.html is loaded on demand by pandas. So I havent added an import fastparquet `explicitly in the code (I will do this soon to see if it resolves the exception).
But I wonder about the exception raising only on cloned experiments.
While writing this I think I understand it now. Running a script locally uses whatever has been installed locally and by instantiating a task the streams are redirected, configurations are analyzed and stored, ...
When cloning experiments, they are been re-constructed out of this information and are running in an isolated environment. If needed packages have not been identified as such, they are missing ...

Well, realy cool stuff this Trains product 👍
Looking forward to dive deaper to it

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

after adding the
import fastparquet
statement to the code, the reconstruction of an clone is working
` Summary - installed python packages:
...

fastparquet==0.4.1
...
Environment setup completed successfully
Starting Task Execution:
...
modeller.py: error: the following arguments are required: --algorithm `unfortunately it raises the next issue.
If the script been used expects to get parameters via command line (which in Trains experiments are identified and stored as parameter when using argparse) it fails to start 😞
I'm sure you have a solution for this.
I could add an option enabling Trains to provide the parameters after command line parsing, but how are the parameters fit to the script?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

how can I enable the tensorboard and have the graphs been stored in trains?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

regarding the clean-up servide, do I need to run this as cron job, or does the trains server support a kind of add-ons where I need to copy the script to?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Sorry, but I don'T understand how the cloned experiment is been provided with parameters.
A task which is been cloned by Trains might get its parameter via task.set_parameters(dict)
this parameters are comming from soe magic analysis of the argparse been used in the script.
AgitatedDove14 when is the call to set_parameter(...) been performed? Is the argparse call been somehow redirected and will receive the data from Trains instead of getting them via sys.argv or wherever argparse is getting them from? If so, why my cloned experiment is reporting missing mandatory arguments?
Starting Task Execution: TRAINS results page: usage: modeller.py [-h] [-v VERBOSE] [-s MONGODB_SERVER] [-a ASSET] [-d DATABASE] [-f FEATURE_COLLECTION] [-t TARGET_COLLECTION] [-c NR_CORES] [-m MODEL_ROOT] --algorithm ALGORITHM [ALGORITHM ...] [--use_trains] [--epochs EPOCHS] [--tracing] modeller.py: error: the following arguments are required: --algorithm

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

well I managed to clone an experiment and adat its parameter on the trains server via browser.
If argparse is been used, no parameter must be defined as required. Instead it has to be managed by the script after parsing the parameter and something mandatory is missing to terminate.
Doing so worked fine for me 😁 at least for this part of work. Now fastparquet and missing packages are failing again...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

another question I have is, are the models been trained stored (I guess they are stored) in the mongodb or in the file system and which format is been used ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Hi WickedGoat98

but is there also a way to delete them, or wipe complete projects?

https://github.com/allegroai/trains/issues/16

Auto cleanup service here:
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

btw: at https://allegro.ai/docs/task.html#task.Task.enqueue the link to the 'Use Case Examples' is broken

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Hi Martin,
you are right. The Trains-agent is running with option cpu-only
(py38) wgo@NVidia-power:~/dev/catwalk$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b99d5103a43c allegroai/trains-agent-services:latest "/usr/agent/entrypoi…" 2 days ago Up 2 days trains-agent-services 16d20b75acf9 allegroai/trains:latest "/opt/trains/wrapper…" 2 days ago Up 2 days 8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp trains-webserver 205af33b09b1 allegroai/trains:latest "/opt/trains/wrapper…" 2 days ago Up 2 days 0.0.0.0:8008->8008/tcp, 8080-8081/tcp trains-apiserver 695f57cd5b16 allegroai/trains:latest "/opt/trains/wrapper…" 2 days ago Up 2 days 8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp trains-fileserver 9e85517ec9f7 redis:5.0 "docker-entrypoint.s…" 2 days ago Up 2 days 0.0.0.0:6379->6379/tcp trains-redis 9719ab098a42 docker.elastic.co/elasticsearch/elasticsearch:7.6.2 "/usr/local/bin/dock…" 2 days ago Up 2 days 0.0.0.0:9200->9200/tcp, 9300/tcp trains-elastic 17f250415e92 mongo:3.6.5 "docker-entrypoint.s…" 2 days ago Up 2 days 0.0.0.0:27017->27017/tcp trains-mongo (py38) wgo@NVidia-power:~/dev/catwalk$ docker exec -it b99d5103a43c bash root@b99d5103a43c:/usr/agent# ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:00 /bin/sh /usr/agent/entrypoint.sh 11 ? Sl 18:39 /usr/bin/python3 /usr/local/bin/trains-agent daemon --services-mode --queue services --create-queue --docker ubuntu:18.04 --cpu-only 17 pts/0 Ss 0:00 bash 31 pts/0 R+ 0:00 ps ax root@b99d5103a43c:/usr/agent#
I followed the instructions on https://allegro.ai/docs/deploying_trains/trains_server_linux_mac/ running it in docker.
Unfortunately I can't find any info on how to configure the container

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

ok thanks, will need to run some tests later

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Another point I see is, that in the workers & queses view the GPU usage is not been reported

It should be reported, if it is not, maybe you are running the trains-agent in cpu mode ? (try adding --gpus)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

models been trained stored ...

mongodb will store url links, the upload itself is controlled via the "output_uri" argument to the Task
If None is provided, the Trains log the local stored model (i.e. link to where you stored your model), if you provide one, Trains will automatically upload the model (into a new subfolder) and store the link to that subfolder.

how can I enable the tensorboard and have the graphs been stored in trains?

Basically if you call Task.init all your TB is automatically also logged by trains (obviously you still have the TB files locally)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

WickedGoat98 the mechanism of cloning and parameter overriding is working only when the trains-agent is launching the experiment. Think of it this way:
Manual execution: trains sends data to server
Automatic (trains-agent) execution: trains pulls data from the server
This applies for both the argparse and connect and connect configuration.
The trains code itself is acting differently when it is executed from the 'trains-agent' context.
Does that help clear things ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

thanks Martin

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WickedGoat98
				
					0
					 × 1

Write your answer

2K Views

17 Answers

4 years ago

2 years ago