Reputation
Badges 1
96 × Eureka!I'm quite new to Kubernetes. What I have found is that the ports I expected, are used
` root@vmd62521:~# kubectl get services -n trains
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mongo-service ClusterIP 10.43.99.44 <none> 27017/TCP 25h
webserver-service NodePort 10.43.49.21 <none> 80:30080/TCP 25h
redis ClusterIP 10.43.62.222 <none> 6379/TCP 25h
elasticsearch-service Clust...
the ports, I'm not sure about
api_server and web_server look ok(py38) wgo@NVidia-power:~/dev/Trains/trains$ curl
{"meta":{"id":"bb5cd73435fb4127b9509ce3a771e95b","trx":"bb5cd73435fb4127b9509ce3a771e95b","endpoint":{"name":"","requested_version":1.0,"actual_version":null},"result_code":400,"result_spath /","error_stack":null},"data":{}}(py38) wgo@NVidia-power:~/dev/Trains/trains$ curl
`
<!doctype html>
<html lang="en">
<head> <meta charset="utf-8"> <title>trains</title> <base href="/"> <meta name="vie...
also the webserver pods log contains entries
redis, mongo and elasticsearch looks also ok
the apiserver pods reports quite a lot
AgitatedDove14 I don't know why, but now it worksrunfile('/home/wgo/dev/Trains/trains/examples/reporting/text_reporting.py', wdir='/home/wgo/dev/Trains/trains/examples/reporting') TRAINS Task: overwriting (reusing) task id=b31459aa2d414ea7b5aaa8c467ee6ad3 This is standard error test 2020-12-12 11:51:44.841 | INFO | __main__:report_logs:26 - That's it, beautiful and simple logging! (using ANSI colors) TRAINS results page:
`
reporting text logs
This is standard output test
hello, th...
AgitatedDove14 regarding the credentials, will I need to take them out of my trains.conf, or might it be common practise to create a user for such pods instantiating additional workers listening on queues?
I have been able to make use of
image: allegroai/trains-agent:latest
in the docker-compose file.yml 🎉
now I will focus on getting it working on Rancher
stay tuned
but before I need to understand how parameters are processed. See my last question in my earlier https://app.slack.com/client/TT9ATQXJ5/CTK20V944/thread/CTK20V944-1603740766.425000
I think I understand now, that the trains.conf has to be located on the node running the trains-agent.
When starting an additional trains-agent not been instantiated by docker-compose so it is not part of the same network, I get problems finding the api_server. localhost:8008 for sure will not be. I dentified the IP of the server running in docker with docker inspect ... and edited ~/trains.conf using it, but unfortunately it still cannot find the apiserver 😞
` (py38) wgo@NVidi...
AgitatedDove14 today I managed to run what I couldn't a month before:)
I didn't understand correctly what you wrote me that time.
The issue I had was, that I missed wget in the trains-agent image and was not able to run a system call of wget.
Now I mannaged to do so based on your imput you gave me by adding theagent.docker_preprocess_bash_script = [...]
in my trains.config, and it worked out of the box 🙂
Basically this issue was the reason why I started learning how to create a Kube...
AgitatedDove14 not sure how to make use of such config / where to add it
Is it to be added in the docker image when generating an own, or can I set this in the Web GUI as property of the experiment I cloned, shall it be added in the original script but type what kind of variable type is 'agent' of?
Thanks, will try on weekend to update the trains.conf
regarding the clean-up servide, do I need to run this as cron job, or does the trains server support a kind of add-ons where I need to copy the script to?
another question I have is, are the models been trained stored (I guess they are stored) in the mongodb or in the file system and which format is been used ?
ok thanks, will need to run some tests later
I ran an local (not dockerized) trains-agenttrains-agent daemon --queue training --create-queue --foreground
which enabled me to see the GPU load on the corresponding view 🙂
Now I got another issue.
It seems when cloning an experiment, a virtual environment is been created with all the modules been identified to be used. Inside this environment the experiment is running.
Am I right?
Is this the case only for clones?
In my Python code I'm trying to read a pandas table which I stored i...
Sorry, but I don'T understand how the cloned experiment is been provided with parameters.
A task which is been cloned by Trains might get its parameter via task.set_parameters(dict)
this parameters are comming from soe magic analysis of the argparse been used in the script.
AgitatedDove14 when is the call to set_parameter(...) been performed? Is the argparse call been somehow redirected and will receive the data from Trains instead of getting them via sys.argv or wherever argparse is gettin...
after adding the
import fastparquet
statement to the code, the reconstruction of an clone is working
` Summary - installed python packages:
...
- fastparquet==0.4.1
...
Environment setup completed successfully
Starting Task Execution:
...
modeller.py: error: the following arguments are required: --algorithm `unfortunately it raises the next issue.
If the script been used expects to get parameters via command line (which in Trains experiments are identified and stored as parameter when using...
well I managed to clone an experiment and adat its parameter on the trains server via browser.
If argparse is been used, no parameter must be defined as required. Instead it has to be managed by the script after parsing the parameter and something mandatory is missing to terminate.
Doing so worked fine for me 😁 at least for this part of work. Now fastparquet and missing packages are failing again...
the log of the fileserver pod seems quite empty
` root@vmd62521:~# kubectl logs fileserver-6f49b74556-2m4n2 -n trains --all-containers
- Serving Flask app "fileserver" (lazy loading)
- Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead. - Debug mode: off
root@vmd62521:~#same to the agentservice
root@vmd62521:~# kubectl logs agentservices-56655788b6-rnbk4 apiserver-7d9cd59844-dfd5s -n train...
AgitatedDove14 I still do not understand, how I can deploy the trains-agent docker image to my trains-server installation so the 'default' queue will be handled.
Once I can do this, it should not be a big thing to add additional workers for more queues.
I found a template for k8s but as I'm quite new to Kubernetes I don't know how to use it.
As I use Rancher I'm able to even edit the trains-agent deployment. I added an additional command to handle the default queue as well, but it seems not ...
Thanks a lot. I will let you know if I manged it :)
AgitatedDove14 ok, but how to deploy a trains-agent?