Reputation
Badges 1
41 × Eureka!Ok, it was indeed something with permission. When I chown everything to root (1000) and chmod 777 it worked. 777 is of course not desirable, so I'm going to narrow it down now.
Thank you for the reply! The migration indeed created this elastic_7 folder.
SuccessfulKoala55 Thank you. I stared myself dead at trains-apiserver , but by coincidence I found this message:
` trains-elastic | {"type": "server", "timestamp": "2020-11-10T06:11:08,956Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "trains", "node.name": "trains", "message": "flood stage disk watermark [95%] exceeded on [QyZ2i1mxTG6yR7uhVWjV9Q][trains][/usr/share/elasticsearch/data/nodes/0] free: 43.3gb[4.7%], all indices on this node will be ...
The relevant commit that deleted the trains logger from Bolts:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/commit/91393eaa2751dc58c26cec6581aba19d63fa42f8
It seems to be related to trains-apiserver , based on the log inside the Docker compose:
` trains-apiserver | [2020-11-10 04:40:14,133] [8] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 20ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-11', '_type': '_doc', '_id': 'rkh0sHUBwyiZSyeZUAov', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queu...
That's useful to know! But actually in this case I want to just test if the code works (run 2 epochs and see if it works). I don't want this to be logged, so I don't Task.init in those cases.
I don't want the code to crash on Trains in those cases.
I see that Task.current_task() returns None if no task is running, so I can use that with an if statement 🙂
Ok, thanks for the info 🙂
` trains-elastic exited with code 1
trains-elastic | OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
trains-elastic | {"type": "server", "timestamp": "2020-11-02T08:04:57,699Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "trains", "node.name": "trains", "message": "uncaught exception in thread [main]",
trains-elastic | "stacktrace": ["org.elast...
Ah I see, it's based on a naming scheme, thanks. Sorry I forgot to link the tutorial I was looking at: https://allegro.ai/docs/examples/frameworks/pytorch/pytorch_tensorboard/
AppetizingMouse58 If I:sudo chmod 771 -R /opt/trains/(taking all permission away from other except execution)
The file permission error comes back, even though everything is under the root user.
Exactly, so that remapping of port 8080 should not be the reason for this issue
Aah, I couldn't find it under PLOTS, but indeed it's there under DEBUG SAMPLES.
As there are quite some hparams, which also change depending on the experiment, I was hoping there was some automatic way of doing it?
For example that it will try to find all dict entries that match "yet_another_property_name": "some value" , and ignore those that don't.
The value has to be converted to a string btw?
AgitatedDove14 TB has the confusion matrix like this:
I see that Trains has been removed 2 days ago: https://github.com/PyTorchLightning/pytorch-lightning/commit/41f5df18a4b96ce753263fadd9c27f1d30e5d7a2
and instead has been moved to Bolts: https://github.com/PyTorchLightning/pytorch-lightning-bolts
However, I cannot find a reason why only Trains has been moved?
Is there anyway how I can figure out in the webinterface what version of Trains is actually running?
TimelyPenguin76 The colleague is actually a her, but she replied that how it's looking now is correct? We're actually both already passed our work time (weekend :D), so we'll take a look at it after the weekend. If there is still something wrong, I'll get back to you. Thanks for offering help though :)
Even when I do a "clean install" (renamed the /opt/trains ) folder and followed the instructions to setup TRAINS, the error appears.
FrothyDog40 Thank you for your reply. I agree that MLflow's serving solution is not going to be of much help for real deployment. However, to me the advantage of quickly setting-up an API access point with just 1 line of code helps with some internal trying out. To colleague: "Hey, this new model seems to do good, want to give it a try?".
I've setup my own Docker container with Sanic (like Flask) and indeed it's not too difficult. However, you'll still hit issues like " https://stackoverflo...
Thank you for your impression! I get a bit more of a Airflow feel for running many tasks to train models with different parameters, which is a good thing.
I'm still skimming through the documents, but TRAINS documentation on how models are stored is a bit vague to me. The https://allegro.ai/docs/examples/examples_models/ only quickly mentions that you can set an output location. Which is a bit shallow compared with the https://mlflow.org/docs/latest/model-registry.html . Any good resource...
With PyTorch Lightning, I only use this line at the beginning of a Jup Notebook:Task.init(project_name=project_name, task_name=task_name)The code to log the confusion matrix is in some .py file though that does not have any Trains code.
Is it possible to log it in a TB compatible way, that will be automatically picked up by Trains? I prefer to keep the .py Trains free.
Port 8008 cannot be changed apparently:
https://allegroai-trains.slack.com/archives/CTK20V944/p1592478619463200?thread_ts=1592476990.463100&cid=CTK20V944
What's the abc issue ? Something Lightning team is responsible for?
Ah my bad, it seems I had to rundocker-compose -f /opt/trains/docker-compose.yml pullonce. I quickly tried trains like half a year ago, so maybe it was using the old images? However, I thought --build would take care of that.
Now it's working 🙂
The only change I made in the .yml file was:
` ports:
- "8080:80"
toports: - "8082:80" `
I already had something running on 8080, but since it's the trains-apiserver and not the webserver, this shouldn't be an issue.
Thank you 😉
AgitatedDove14 Thank you, this code example is very helpful!
First I tried without build, but same problem. --build just means that it will re-download all layers instead of using the ones already cached.
Ok it's that the user group also has to be root. I ran the following:sudo chmod 775 -R /opt/trains/ sudo chown -R root:root /opt/trainsand it works.
It seems that it has to be 775 with both user and group as root. E.g. 771 does not work, because than the docker command has to be used with sudo (if I want to use my default sudo-user account)
I have a numpy array, but I indeed didn't see a TB way of doing it. I guess that's not really an issue to add. The code should also be usable without Trains. How should I test if there is a current task? (I need a VPN on to log to TRAINS, which can be annoying for small tests)