With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?
The data that you sent looks fine. It seems that you actually has these iterations in Elasticsearch. To check whether it is the case please run the following command in the shell on your host. You should get the first 10 task events with the smallest iterations:curl -XGET -H "Content-Type: application/json" localhost:9200/events-training_stats_scalar*/_search?pretty -d' { "query": { "term": {"task": "d45ecb5ad7084175bd83dd39777b10c5"} }, "sort": {"iter": "asc"} }'
MassiveHippopotamus56 The data that you posted from the browser developers tool seems coming from the "Headers" tab. Can you please post the data from the "Payload" and "Response" tabs. This is in case you run in Chrome. In other browsers the tabs may have different names
Hi CooperativeFox72 , how much free space do you have on your disk now? If you run du on your /opt/trains/data/elastic_7 folder in let's say 5 mins intervals do you see the folder size is growing?
Great! What error do you still see in UI when comparing more than 20 experiments? At the time of error do you see any error response from the apiserver (in the browser network tab)? When the call to compare of 20+ task metrics succeed how much time does it usually takes in your environment?
Can you share all the error info that you get in the network tab?
Hi DilapidatedDucks58 , I am trying to reproduce the "Connection is full warning". Do you override any apiserver environment variables is docker compose? If yes then can you share your version of docker-compose? Do you provide a configuration file for gunicorn? Can you please share it?
Hi @<1558986867771183104:profile|ShakyKangaroo32> , can you please share the logs from the async_delete docker container?
The 1.10 version handles files deletion differently so there is chance that it fixes the issue. If you use the default apiserver port then I would try upgrading. If you override the apiserver port then please wait for the hotfix version 1.10.1 that should be released soon
I would backup the dbs prior to the upgrade so that you can rollback in case any issue arise in the upgrade process
Hi @<1523701868901961728:profile|ReassuredTiger98> , what version of the apiserver are you using?
Hi @<1523707653782507520:profile|MelancholyElk85> , what version of the apiserver are you using?
We found the issue. It will be fixed in the upcoming patch for the open-v1.14 release
As long as you delete only from the deleted tasks folders it should be OK
SubstantialBaldeagle49 This is fine. When you start docker-compose it takes different time for the services to start. Apiserver waits for the Elasticsearch to start and proceeds once it is ready. Can you reproduce the buckets issue and share the apiserver log that contains it?
SubstantialBaldeagle49 The log looks OK. Where do you see the error?
The index "events-plot-d1bd92a3b039400cbafc60a7a5b1e52b" is red meaning that it is corrupted and elastic cannot work with it. The most straightforward solution would be to delete this index but it will result in all the plots generated so far will be lost.
I am not sure about the reasons. What you can do is to backup your /opt/trains/data folder periodically (preferably stopping the docker compose before it). Another possibility is to configure your elasticsearch to run as a cluster with 2 or more nodes on the same or different machine. This will allow elastic to replicate your indices to other nodes.
What can be seen in the logs is that for some reason Elasticsearch had internal failure when trying to perform the plots query. I will send you the instruction on how to check for the health of ES nodes. It may provide us with some clues
Please run these commands and see if you have any "red" statuses in the output:
curl " http://localhost:9200/_cluster/health?pretty "
curl " http://localhost:9200/_cluster/health?level=indices&pretty "
SubstantialBaldeagle49 Well, I see. Elaticsearch does not support putting that large number into max_buckets. From the error message that I see in the apiserver log I am not sure that the original problem is connected to the buckets number. Can you please revert the max_bucket change, reproduce the original problem and share the elasticsearch log?
SubstantialBaldeagle49 This should collect the logs: 'sudo docker logs trains-apiserver >& apiserver.logs'
Yes, it is safe to put number_of_replicas to 0 and refresh_interval to -1 for the target index before the reindex and then put them back after the reindex is finished
Hi JitteryCoyote63 , are you still missing a month of data in the event logs? If you do cat indices do you see the same amount of docs in the original and the new ones?
We can compare with the table that you sent yesterday. Unless a lot of new events were written since then
This one is indeed dynamic but can be set as follows: "plot_len":{"type":"long"}
The tasks themselves will stay until you succeed to delete them from the client. Here we tried to see why deleting their data from ES timed out. From what I see no data was actually deleted (most likely because of the previous delete efforts that actually deleted the data though caused time out in the apiserver). What seems problematic is the amount of time that each operation took (19 and 16 seconds). It may be due to insufficient memory/cpu allocation to ES container or due to the 50Gb inde...
Hi @<1523701868901961728:profile|ReassuredTiger98> , how exactly do you override the values in storage_credentials file? Do you prepare a new docker image with the changed file or map this file from outside with the volume mapping in the docker compose or through the env variables? What is also important is that you do this override for the async_delete service. It is the service that actually uses the storage credentials. Not the apiserver itself
Hi RotundSquirrel78 , can you please check that your docker compose file has the correct volume mapping for elasticsearch service? From the output of the upgrade script I assume it should be from /home/orpat/trains/data/elastic_7 into /usr/share/elasticsearch/data
Yes exactly, can you please verify that you use /home/orpat/trains/data/elastic_7 in the docker compose of 1.5?