Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?curl -XGET
curl -XGET
Actually the task logs will be lost. The tasks themselves and their reported metrics and plots would stay. The command is the following:curl -XDELETE localhost:9200/events-log-d1bd92a3b039400cbafc60a7a5b1e52b
Hi @<1523701260895653888:profile|QuaintJellyfish58> , we are in the final stages of preparing the hotfix version open-v1.14.1. It should be released this week
Strange:( What version of ElasticSearch do you currently use (before the upgrade)? Can you please share your docker compose file?
What about the UID for epdadmin user? 'id -u epdadmin'
Just a moment, it seems that this api is supported only on ES6 or newer. From the other discussion in this channel: for ES5 you have to download the basic license and install it as described in this article: https://medium.com/@ospaarmann/tidbits-solving-the-elasticsearch-x-pack-license-issue-in-docker-d15bb22d82fd
IdealPanda97 What can be seen now is that some of the indices (at least queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08) are in the corrupted state. This can be the result of abnormal termination of ES or some other situation. The queue metrics index is not particularly important but there maybe other indices that are also corrupted. To map the cluster and indices state you can issue the following commands (with the running ES5 docker container). Look for the "red" statuses in the out...
Thanks! In this log it mentions that the source elastic 5 has failed during the reindex process. Can you also share the logs from the 'elastic-upgrade' service?
Sorry, I did not write it properly. You need to run the following curl command from the command line:
curl -XPOST ' http://localhost:9200/_xpack/license/start_basic '
Hi IdealPanda97 , can you share the logs for the 'elastic-upgrade-7' docker container? According to the upgrade log there was some problem with Elasticsearch during indices copy.
Hi ResponsiveCamel97 , the shards and indices stats look fine. Can you please try the async delete of the task data? You can run the following line in the shell inside the apiserver container. Just replace <task_id> with your actual task idcurl -XPOST -H "Content-Type: application/json" "
" -d'{"query": {"term": {"task": "<task_id>"}}}'
You should get in response something like this:{"task":"p6350SG7STmQALxH-E3CLg:1426125"}
Then you can periodically ping ES on the status of the r...
Did you try restarting the docker compose since the problem start happening?
Enjoy the new version!
Hi @<1585078752969232384:profile|FantasticDuck7> , there is an apiserver configuration file apiserver->config->default->services->storage_credentials.conf
It contains the parameters for accessing files on the external storages like s3, google or azure. Please provide the same minio server access parameters as you do for the SDK configuration.
The actual deletion is performed by the async_delete service. You can inspect its logs with "sudo docker logs async_delete" command. Before configuring...
SubstantialBaldeagle49 Well, I see. Elaticsearch does not support putting that large number into max_buckets. From the error message that I see in the apiserver log I am not sure that the original problem is connected to the buckets number. Can you please revert the max_bucket change, reproduce the original problem and share the elasticsearch log?
If it returns an OK result then rerun the upgrade process again.
The volumes section of elasticsearch service looks OK to me:
/opt/trains/data/elastic_7:/usr/share/elasticsearch/data
SubstantialBaldeagle49 The log looks OK. Where do you see the error?
IdealPanda97 Ok, I see. Can you please run the following command, then restart the docker-compose and see if it makes any difference?sudo chown -R 1000:1000 /opt/trains
Enjoy the new version:) Would still be interesting to see what caused ES7 to stop responding.
There should be a log file in the directory where you run the script. It contains more info. Can you please send me the log?
Please run these commands and see if you have any "red" statuses in the output:
curl " http://localhost:9200/_cluster/health?pretty "
curl " http://localhost:9200/_cluster/health?level=indices&pretty "
Sure, you delete it with the following command:
curl -XDELETE " http://localhost:9200/events-plot-d1bd92a3b039400cbafc60a7a5b1e52b "
Once deleted it will be automatically recreated by the api server and should see the plots from the new tasks that you run afterwards
For running of the old version of Trains the same setting can be added to elasticsearch environment section in the docker compose
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elasticĀ > log.txt 2>&1
The index "events-plot-d1bd92a3b039400cbafc60a7a5b1e52b" is red meaning that it is corrupted and elastic cannot work with it. The most straightforward solution would be to delete this index but it will result in all the plots generated so far will be lost.
Hi DilapidatedDucks58 , I am trying to reproduce the "Connection is full warning". Do you override any apiserver environment variables is docker compose? If yes then can you share your version of docker-compose? Do you provide a configuration file for gunicorn? Can you please share it?
The data that you sent looks fine. It seems that you actually has these iterations in Elasticsearch. To check whether it is the case please run the following command in the shell on your host. You should get the first 10 task events with the smallest iterations:curl -XGET -H "Content-Type: application/json" localhost:9200/events-training_stats_scalar*/_search?pretty -d' { "query": { "term": {"task": "d45ecb5ad7084175bd83dd39777b10c5"} }, "sort": {"iter": "asc"} }'
Can you try deleting the application cookie? While being on the trains page in the browser devtools you navigate to Application->Cookies and under it delete any trains cookies that are there. I believe that you will need to login after that