Reputation
Badges 1
41 × Eureka!trains ( 0.15.1-367 )
appears to be the version, same as you. Thank you. Appears Trains is up to date.
Apparently there should be 6 of them:
Hmm, after connecting with the VPN again and using ctrl + F5, there is no complaint anymore. Although a colleague uploaded a Seaborn plot, but it's still not showing up, which I thought was fixed in the new version?
The plots page is pure white of that experiment, and not the usual "No chart data" if no plot was uploaded.
After a while I get the message:
New version available
Click the reload button below to reload the web page
I click the "RELOAD" button and the "newer version" message disappear. However, some plots still don't show up (fixed in 0.15.1). If I refresh the TRAINS webinterface, the "newer version" message appears again.
Is there anyway how I can figure out in the webinterface what version of Trains is actually running?
It's my colleague's experiment (with scikit-learn), so I'm not sure about the details.
TimelyPenguin76 The colleague is actually a her, but she replied that how it's looking now is correct? We're actually both already passed our work time (weekend :D), so we'll take a look at it after the weekend. If there is still something wrong, I'll get back to you. Thanks for offering help though :)
The relevant commit that deleted the trains logger from Bolts:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/commit/91393eaa2751dc58c26cec6581aba19d63fa42f8
AgitatedDove14 Done!
The only change I made in the .yml file was:
` ports:
- "8080:80"
to
ports: - "8082:80" `
I already had something running on 8080, but since it's the trains-apiserver and not the webserver, this shouldn't be an issue.
First I tried without build, but same problem. --build
just means that it will re-download all layers instead of using the ones already cached.
Exactly, so that remapping of port 8080
should not be the reason for this issue
Ah my bad, it seems I had to rundocker-compose -f /opt/trains/docker-compose.yml pull
once. I quickly tried trains like half a year ago, so maybe it was using the old images? However, I thought --build
would take care of that.
Now it's working 🙂
Thank you 😉
With PyTorch Lightning, I only use this line at the beginning of a Jup Notebook:Task.init(project_name=project_name, task_name=task_name)
The code to log the confusion matrix is in some .py file though that does not have any Trains code.
Is it possible to log it in a TB compatible way, that will be automatically picked up by Trains? I prefer to keep the .py Trains free.
AgitatedDove14 TB has the confusion matrix like this:
That's useful to know! But actually in this case I want to just test if the code works (run 2 epochs and see if it works). I don't want this to be logged, so I don't Task.init
in those cases.
I don't want the code to crash on Trains in those cases.
I see that Task.current_task()
returns None if no task is running, so I can use that with an if statement 🙂
Aah, I couldn't find it under PLOTS, but indeed it's there under DEBUG SAMPLES.
AgitatedDove14 There is only a events.out.tfevents.1604567610.system.30991.0
file.
If I open this with a text editor, most is unreadable, but I do find a the letters "PNG" close to the name of the confusion matrix. So it looks like the image is encoded inside the TB log file?
So if I want it under plots, I would need to call e.g. report_confusion_matrix
right?
Ah I see, it's based on a naming scheme, thanks. Sorry I forgot to link the tutorial I was looking at: https://allegro.ai/docs/examples/frameworks/pytorch/pytorch_tensorboard/
Port 8008
cannot be changed apparently:
https://allegroai-trains.slack.com/archives/CTK20V944/p1592478619463200?thread_ts=1592476990.463100&cid=CTK20V944
I have a numpy array, but I indeed didn't see a TB way of doing it. I guess that's not really an issue to add. The code should also be usable without Trains. How should I test if there is a current task? (I need a VPN on to log to TRAINS, which can be annoying for small tests)
As there are quite some hparams, which also change depending on the experiment, I was hoping there was some automatic way of doing it?
For example that it will try to find all dict entries that match "yet_another_property_name": "some value"
, and ignore those that don't.
The value has to be converted to a string btw?
SuccessfulKoala55 Thank you. I stared myself dead at trains-apiserver
, but by coincidence I found this message:
` trains-elastic | {"type": "server", "timestamp": "2020-11-10T06:11:08,956Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "trains", "node.name": "trains", "message": "flood stage disk watermark [95%] exceeded on [QyZ2i1mxTG6yR7uhVWjV9Q][trains][/usr/share/elasticsearch/data/nodes/0] free: 43.3gb[4.7%], all indices on this node will be ...
Even when I do a "clean install" (renamed the /opt/trains
) folder and followed the instructions to setup TRAINS, the error appears.
It seems to be related to trains-apiserver
, based on the log inside the Docker compose:
` trains-apiserver | [2020-11-10 04:40:14,133] [8] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 20ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-11', '_type': '_doc', '_id': 'rkh0sHUBwyiZSyeZUAov', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queu...
Hi AgitatedDove14
Not using trains-agent yet. Just using PyTorch Lightning in Jupyter Notebook with as Logger Trains.
So I'm talking about runtime and GPU usage in experiments.
AgitatedDove14 Thank you, this code example is very helpful!
I see that Trains has been removed 2 days ago: https://github.com/PyTorchLightning/pytorch-lightning/commit/41f5df18a4b96ce753263fadd9c27f1d30e5d7a2
and instead has been moved to Bolts: https://github.com/PyTorchLightning/pytorch-lightning-bolts
However, I cannot find a reason why only Trains has been moved?