curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d'{"persistent" : {"search.max_open_scroll_context": 1000}}'
Great! What error do you still see in UI when comparing more than 20 experiments? At the time of error do you see any error response from the apiserver (in the browser network tab)? When the call to compare of 20+ task metrics succeed how much time does it usually takes in your environment?
Hi H4dr1en, there is a chance that the problem is that in parallel reindexing of data. You can try to replace parallel=max(docker_resources.cpus // 2, 1)
at line 190 with
parallel=1
I think you will need to remove the /opt/trains/data/elastic_7 folder before script restart
Great:) The let's try to get the logs. Maybe we can get without changing the upgrade script. Please run 'sudo docker ps -a' if you see the exited container with name 'elastic-upgrade-7' then please save its logs to the file with the below command and send the file to me:
docker logs <container_id_for_elastic-upgrade-7> >& elastic_logs.txt
We just uploaded the new update script into
https://github.com/allegroai/trains-server/releases/download/0.16.0/trains-server-0.16.0-migration.zip
It has several improvements and there is a chance that it will overcome the issue that you are facing. Also, please check that you have enough disk space for copying of ES data.
Thanks for the update. What can be seen from the log is that for some reason after copying of couple of indices Elasticsearch 7 becomes unavailable. I think we can find the reasons in the Elasticsearch 7 logs. I can send you the instructions on how to proceed (it requires a minimal change to the upgrade script so that the upgrade containers are not removed after the script run and inspection of ES7 logs)
Hi @<1523701457835003904:profile|AbruptHedgehog21> can you please share the logs from the async_delete service? It is responsible for the actual deletion of the data
We can compare with the table that you sent yesterday. Unless a lot of new events were written since then
There should be a log file in the directory where you run the script. It contains more info. Can you please send me the log?
Do you see any error in the browser network tab?
Yes, it is safe to put number_of_replicas to 0 and refresh_interval to -1 for the target index before the reindex and then put them back after the reindex is finished
Hi QuaintJellyfish58 , thanks for the feedback. I am trying to compare what you send and receive for team's view with what you get in My-work view. Can you please also send the data for the same requests and responses in the My work view structured in the same way like you sent for the team view now?
Hi QuaintJellyfish58 , I am investigating the issue. Can you please also send the request and response from projects.get_all when you are in the Team's Work view (the case where there is no undefined project)?
Hi QuaintJellyfish58 , it seems that we identified the problem. The undefined project that you see is not a real project. It is a placeholder where the statistics of ex-1 project should be shown. We found a bug in the apiserver that under these particular conditions fails to return the ex-1 project data so the placeholder remains empty (undefined). If I understand correctly it should only cause the inconvenience but not influence your workflow. Is it correct? We are fixing the issue in the n...
Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
Hi VexedPeacock35 , I suspect that Elasticsearch works too hard and periodically misses timeouts on recording events. How much memory and CPU is it using? Can you increase the memory that is allocated to it and see whether this helps?
Hi @<1523701868901961728:profile|ReassuredTiger98> , what version of the apiserver are you using?
@<1523701868901961728:profile|ReassuredTiger98> Strange:( in 1.10 we already had the code for clearing ES scrolls created during the task deletion. I would recommend upgrading to the latest release v1.12.1 anyway. In addition you can instruct ES to allow more open scrolls like below. By default it is limited to 500.
If you run the following command 'sudo chown -R 1000:1000 /opt/trains' does it change anything?
Can you share all the error info that you get in the network tab?
Hi DilapidatedDucks58 , I am trying to reproduce the "Connection is full warning". Do you override any apiserver environment variables is docker compose? If yes then can you share your version of docker-compose? Do you provide a configuration file for gunicorn? Can you please share it?
Hi JitteryCoyote63 , are you still missing a month of data in the event logs? If you do cat indices do you see the same amount of docs in the original and the new ones?
We found the issue. It will be fixed in the upcoming patch for the open-v1.14 release
Enjoy the new version:) Would still be interesting to see what caused ES7 to stop responding.
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elasticĀ > log.txt 2>&1
Hi QuaintJellyfish58 in the latest data that you sent I see only the responses (some of them are marked as payloads but they are actually responses). What would be very interesting is to see the requests (payloads) that resulted in the following empty responses:
` # response
{"meta":{"id":"aaaffe49ace64f1a8b0211925afcfd32","trx":"aaaffe49ace64f1a8b0211925afcfd32","endpoint":{"name":"projects.get_all_ex","requested_version":"2.20","actual_version":"1.0"},"result_code":200,"result_subcode":0,...
Thanks, I think that I see the problem,
What about the UID for epdadmin user? 'id -u epdadmin'
The volumes section of elasticsearch service looks OK to me:
/opt/trains/data/elastic_7:/usr/share/elasticsearch/data