Reputation
Badges 1
96 × Eureka!the picture seemed to be missing.
sorry I tried but can't upload the picture to here. So I add a link to it https://drive.google.com/file/d/1HYYKDOY09hnE-DeCTPdZXpKy7537g5Ka/view?usp=sharing
Cool
I'm already impressed about what Trains does with just 2 lines of code
I think I understand now, that the trains.conf has to be located on the node running the trains-agent.
When starting an additional trains-agent not been instantiated by docker-compose so it is not part of the same network, I get problems finding the api_server. localhost:8008 for sure will not be. I dentified the IP of the server running in docker with docker inspect ... and edited ~/trains.conf using it, but unfortunately it still cannot find the apiserver π
` (py38) wgo@NVidi...
AgitatedDove14 I still do not understand, how I can deploy the trains-agent docker image to my trains-server installation so the 'default' queue will be handled.
Once I can do this, it should not be a big thing to add additional workers for more queues.
I found a template for k8s but as I'm quite new to Kubernetes I don't know how to use it.
As I use Rancher I'm able to even edit the trains-agent deployment. I added an additional command to handle the default queue as well, but it seems not ...
AgitatedDove14 today I managed to run what I couldn't a month before:)
I didn't understand correctly what you wrote me that time.
The issue I had was, that I missed wget in the trains-agent image and was not able to run a system call of wget.
Now I mannaged to do so based on your imput you gave me by adding theagent.docker_preprocess_bash_script = [...]
in my trains.config, and it worked out of the box π
Basically this issue was the reason why I started learning how to create a Kube...
regarding the clean-up servide, do I need to run this as cron job, or does the trains server support a kind of add-ons where I need to copy the script to?
another question I have is, are the models been trained stored (I guess they are stored) in the mongodb or in the file system and which format is been used ?
- how can I enable the tensorboard and have the graphs been stored in trains?
ok thanks, will need to run some tests later
I ran an local (not dockerized) trains-agenttrains-agent daemon --queue training --create-queue --foreground
which enabled me to see the GPU load on the corresponding view π
Now I got another issue.
It seems when cloning an experiment, a virtual environment is been created with all the modules been identified to be used. Inside this environment the experiment is running.
Am I right?
Is this the case only for clones?
In my Python code I'm trying to read a pandas table which I stored i...
Sorry, but I don'T understand how the cloned experiment is been provided with parameters.
A task which is been cloned by Trains might get its parameter via task.set_parameters(dict)
this parameters are comming from soe magic analysis of the argparse been used in the script.
AgitatedDove14 when is the call to set_parameter(...) been performed? Is the argparse call been somehow redirected and will receive the data from Trains instead of getting them via sys.argv or wherever argparse is gettin...
well I managed to clone an experiment and adat its parameter on the trains server via browser.
If argparse is been used, no parameter must be defined as required. Instead it has to be managed by the script after parsing the parameter and something mandatory is missing to terminate.
Doing so worked fine for me π at least for this part of work. Now fastparquet and missing packages are failing again...
api_server and web_server look ok(py38) wgo@NVidia-power:~/dev/Trains/trains$ curl
{"meta":{"id":"bb5cd73435fb4127b9509ce3a771e95b","trx":"bb5cd73435fb4127b9509ce3a771e95b","endpoint":{"name":"","requested_version":1.0,"actual_version":null},"result_code":400,"result_spath /","error_stack":null},"data":{}}(py38) wgo@NVidia-power:~/dev/Trains/trains$ curl
`
<!doctype html>
<html lang="en">
<head> <meta charset="utf-8"> <title>trains</title> <base href="/"> <meta name="vie...
btw: at https://allegro.ai/docs/task.html#task.Task.enqueue the link to the 'Use Case Examples' is broken
after adding the
import fastparquet
statement to the code, the reconstruction of an clone is working
` Summary - installed python packages:
...
- fastparquet==0.4.1
...
Environment setup completed successfully
Starting Task Execution:
...
modeller.py: error: the following arguments are required: --algorithm `unfortunately it raises the next issue.
If the script been used expects to get parameters via command line (which in Trains experiments are identified and stored as parameter when using...
AgitatedDove14 unfortunately all tries to get any responce from the webUI failed π
(py38) wgo@NVidia-power : ~ $ ping 10.43.138.186
PING 10.43.138.186 (10.43.138.186) 56(84) Bytes Daten.
^C
--- 10.43.138.186 ping statistics ---
4 Pakete ΓΌbertragen, 0 empfangen, 100% Paketverlust, Zeit 3062ms
(py38) wgo@NVidia-power : ~ $ curl http://10.43.97.217:30080
^C
(py38) wgo@NVidia-power : ~ $ curl http://10.43.138.186
^C
(py38) wgo@NVidia-power : ~ $ curl http://10.43.138.186...
also the webserver pods log contains entries
AgitatedDove14 the index astype(str) did the magic π thanks
AgitatedDove14 unfortunately I still have issues with the plot. After removing the first row I get a wierd empty remote plot where the axis is a counter instead of a date. Seems not to be clearml related and I need to get more in touch with plotly to analyze it.
file_server not(py38) wgo@NVidia-power:~/dev/Trains/trains$ curl
`
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>405 Method Not Allowed</title> <h1>Method Not Allowed</h1> <p>The method is not allowed for the requested URL.</p> `
Sounds good :) I'm currently trying to run an orca instance ... but without success
or do you mean the machine I ran the experiment locally?