Reputation
Badges 1
41 × Eureka!Ah I see, it's based on a naming scheme, thanks. Sorry I forgot to link the tutorial I was looking at: https://allegro.ai/docs/examples/frameworks/pytorch/pytorch_tensorboard/
Hi AgitatedDove14
Not using trains-agent yet. Just using PyTorch Lightning in Jupyter Notebook with as Logger Trains.
So I'm talking about runtime and GPU usage in experiments.
Even when I do a "clean install" (renamed the /opt/trains
) folder and followed the instructions to setup TRAINS, the error appears.
Aah, I couldn't find it under PLOTS, but indeed it's there under DEBUG SAMPLES.
I have a numpy array, but I indeed didn't see a TB way of doing it. I guess that's not really an issue to add. The code should also be usable without Trains. How should I test if there is a current task? (I need a VPN on to log to TRAINS, which can be annoying for small tests)
First I tried without build, but same problem. --build
just means that it will re-download all layers instead of using the ones already cached.
Exactly, so that remapping of port 8080
should not be the reason for this issue
AppetizingMouse58 If I:sudo chmod 771 -R /opt/trains/
(taking all permission away from other except execution)
The file permission error comes back, even though everything is under the root user.
Is it possible it's not just about the root user, but also the root group?
Ok it's that the user group also has to be root. I ran the following:sudo chmod 775 -R /opt/trains/ sudo chown -R root:root /opt/trains
and it works.
It seems that it has to be 775
with both user and group as root. E.g. 771
does not work, because than the docker
command has to be used with sudo
(if I want to use my default sudo-user account)
Would have been nice if they would have reached out to you guys/gals before removing Trains 😅
The only change I made in the .yml file was:
` ports:
- "8080:80"
to
ports: - "8082:80" `
I already had something running on 8080, but since it's the trains-apiserver and not the webserver, this shouldn't be an issue.
Ok, thanks for the info 🙂
After a while I get the message:
New version available
Click the reload button below to reload the web page
I click the "RELOAD" button and the "newer version" message disappear. However, some plots still don't show up (fixed in 0.15.1). If I refresh the TRAINS webinterface, the "newer version" message appears again.
Is there anyway how I can figure out in the webinterface what version of Trains is actually running?
TimelyPenguin76 The colleague is actually a her, but she replied that how it's looking now is correct? We're actually both already passed our work time (weekend :D), so we'll take a look at it after the weekend. If there is still something wrong, I'll get back to you. Thanks for offering help though :)
AgitatedDove14 Done!
AgitatedDove14 There is only a events.out.tfevents.1604567610.system.30991.0
file.
If I open this with a text editor, most is unreadable, but I do find a the letters "PNG" close to the name of the confusion matrix. So it looks like the image is encoded inside the TB log file?
trains ( 0.15.1-367 )
appears to be the version, same as you. Thank you. Appears Trains is up to date.
Apparently there should be 6 of them:
/opt/trains/
:
` $ ls -al
total 120
drwxrwsrwx 7 root miniconda 4096 Nov 2 18:15 .
drwxr-xr-x 15 root root 4096 Oct 5 15:12 ..
drwxrwxrwx 38 root miniconda 4096 Nov 2 18:15 agent
drwxrwxrwx 2 root miniconda 4096 Jun 19 14:43 config
drwxrwxrwx 8 root miniconda 4096 Nov 2 18:11 data
-rwxrwxrwx 1 root miniconda 4383 Jun 19 14:46 docker-compose_0.15.0.yml
-rwxrwxrwx 1 root miniconda 4375 Jun 26 15:06 docker-compose_0.15.1.yml
-rwxrwxrwx 1 root miniconda 4324 Nov 2 18:...
Same problem with 775
As there are quite some hparams, which also change depending on the experiment, I was hoping there was some automatic way of doing it?
For example that it will try to find all dict entries that match "yet_another_property_name": "some value"
, and ignore those that don't.
The value has to be converted to a string btw?
What's the abc issue
? Something Lightning team is responsible for?
The relevant commit that deleted the trains logger from Bolts:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/commit/91393eaa2751dc58c26cec6581aba19d63fa42f8
Ah my bad, it seems I had to rundocker-compose -f /opt/trains/docker-compose.yml pull
once. I quickly tried trains like half a year ago, so maybe it was using the old images? However, I thought --build
would take care of that.
Now it's working 🙂
` trains-elastic exited with code 1
trains-elastic | OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
trains-elastic | {"type": "server", "timestamp": "2020-11-02T08:04:57,699Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "trains", "node.name": "trains", "message": "uncaught exception in thread [main]",
trains-elastic | "stacktrace": ["org.elast...
So if I want it under plots, I would need to call e.g. report_confusion_matrix
right?
Port 8008
cannot be changed apparently:
https://allegroai-trains.slack.com/archives/CTK20V944/p1592478619463200?thread_ts=1592476990.463100&cid=CTK20V944
It's my colleague's experiment (with scikit-learn), so I'm not sure about the details.