The volumes section of elasticsearch service looks OK to me:
/opt/trains/data/elastic_7:/usr/share/elasticsearch/data
Hi SoggyBeetle95 , from what version of clearml did you upgrade? About the tasks that disappeared: you do not see these tasks at all or you see these tasks with no results?
Hi Elior, chances are that you do not have enough space for Elasticsearch on your storage. Please check the ES logs and increase the available disk space.
Can you run 'ls -al' in the /opt/trains/data folder and also in the /opt/trains/data/elastic_7 folder and send the output?
Hi DefeatedCrab47 , ES docker requires that it is data folder belongs to 1000:1000 user and group. If you want to transfer your existing data from trains 15.1 then please follow the guide https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/
The script that is run in this guide should create elastic_7 folder with the correct permissions and transfer all your existing data
Actually the task logs will be lost. The tasks themselves and their reported metrics and plots would stay. The command is the following:curl -XDELETE localhost:9200/events-log-d1bd92a3b039400cbafc60a7a5b1e52b
Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?curl -XGET
curl -XGET
Are you sure that it was performed fully according to the suggested sequence? The error that you posted says that v3.6 data is incompatible with v4.4 and suggests version 4.2 or earlier. Step 3 starts with mongo 4.0 that should be able to open v3.6 data. And then a number of gradual updates through versions 4.0->4.2->4.4 is performed
Thanks, I think that I see the problem,
There should be a log file in the directory where you run the script. It contains more info. Can you please send me the log?
What about the UID for epdadmin user? 'id -u epdadmin'
Hi RotundSquirrel78 , can you please check that your docker compose file has the correct volume mapping for elasticsearch service? From the output of the upgrade script I assume it should be from /home/orpat/trains/data/elastic_7 into /usr/share/elasticsearch/data
If you run the following command 'sudo chown -R 1000:1000 /opt/trains' does it change anything?
I am not sure about the reasons. What you can do is to backup your /opt/trains/data folder periodically (preferably stopping the docker compose before it). Another possibility is to configure your elasticsearch to run as a cluster with 2 or more nodes on the same or different machine. This will allow elastic to replicate your indices to other nodes.
With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?
Hi QuaintJellyfish58 , I am investigating the issue. Can you please also send the request and response from projects.get_all when you are in the Team's Work view (the case where there is no undefined project)?
@<1523701868901961728:profile|ReassuredTiger98> Strange:( in 1.10 we already had the code for clearing ES scrolls created during the task deletion. I would recommend upgrading to the latest release v1.12.1 anyway. In addition you can instruct ES to allow more open scrolls like below. By default it is limited to 500.
Hi @<1523707653782507520:profile|MelancholyElk85> , what version of the apiserver are you using?
Hi SarcasticSparrow10 , I am trying to understand whether we have some gaps in the instructions. In the upgrade process did you perform the steps 3-10 of the below instruction? Were there any errors when performing these steps?
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration
Yes exactly, can you please verify that you use /home/orpat/trains/data/elastic_7 in the docker compose of 1.5?
Hi SteadyFox10 , how many unique metrics and variants do you have in this task? We may be hitting some limit here
Hi JitteryCoyote63 , you mentioned that download task logs brings all the events. It would be interesting to compare the events that are in the download log but not in the task log screen with those that are returned in the screen too. Can you please share the download task logs file and the request and response that you get from the events.get_task_log for the same task?
Hi QuaintJellyfish58 , it seems that we identified the problem. The undefined project that you see is not a real project. It is a placeholder where the statistics of ex-1 project should be shown. We found a bug in the apiserver that under these particular conditions fails to return the ex-1 project data so the placeholder remains empty (undefined). If I understand correctly it should only cause the inconvenience but not influence your workflow. Is it correct? We are fixing the issue in the n...
It seems that index events-log-d1bd92a3b039400cbafc60a7a5b1e52b got corrupted. In case there are no backups the only choice would be to delete this index from elasticsearch
SubstantialBaldeagle49 Well, I see. Elaticsearch does not support putting that large number into max_buckets. From the error message that I see in the apiserver log I am not sure that the original problem is connected to the buckets number. Can you please revert the max_bucket change, reproduce the original problem and share the elasticsearch log?
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elastic > log.txt 2>&1
Hi IdealPanda97 , can you please check your available disk space and available RAM? According to the logs all the services (Elastic, Mongo, Redis) fail to start
If you open the browser developer tools and navigate to the task console logs for one of the tasks that you do not get the logs anymore. Do you see any errors (red lines) in the api calls? Can you share the payload and response from the events.get_task_log call?