Hi IdealPanda97 , can you share the logs for the 'elastic-upgrade-7' docker container? According to the upgrade log there was some problem with Elasticsearch during indices copy.
Please run these commands and see if you have any "red" statuses in the output:
curl " http://localhost:9200/_cluster/health?pretty "
curl " http://localhost:9200/_cluster/health?level=indices&pretty "
Hi UnevenDolphin73 . how many artifacts do you have on this task? We are storing task metadata in Mongo and there is a limit of 16Mb per a single document. While the artifact itself is not stored under the task there is some metadata (notably the uri and display_data/preview) that is stored for each artifact
IdealPanda97 What can be seen now is that some of the indices (at least queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08) are in the corrupted state. This can be the result of abnormal termination of ES or some other situation. The queue metrics index is not particularly important but there maybe other indices that are also corrupted. To map the cluster and indices state you can issue the following commands (with the running ES5 docker container). Look for the "red" statuses in the out...
Hi QuaintJellyfish58 in the latest data that you sent I see only the responses (some of them are marked as payloads but they are actually responses). What would be very interesting is to see the requests (payloads) that resulted in the following empty responses:
` # response
{"meta":{"id":"aaaffe49ace64f1a8b0211925afcfd32","trx":"aaaffe49ace64f1a8b0211925afcfd32","endpoint":{"name":"projects.get_all_ex","requested_version":"2.20","actual_version":"1.0"},"result_code":200,"result_subcode":0,...
Yes, the command would be like this: curl -XDELETE " http://localhost:9200/queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08 "
If you decide to delete the "red" indices then you can proceed with the command above issuing it for each problematic index. The queue metrics index is not very important but the second one "events-logs" contains all the log messages produced by your tasks in August. You will still have debug images and scalar metrics reported by these tasks but the log messages ...
SubstantialBaldeagle49 This should collect the logs: 'sudo docker logs trains-apiserver >& apiserver.logs'
Yes, it is safe to put number_of_replicas to 0 and refresh_interval to -1 for the target index before the reindex and then put them back after the reindex is finished
Hi SoggyBeetle95 , from what version of clearml did you upgrade? About the tasks that disappeared: you do not see these tasks at all or you see these tasks with no results?
Setting up an elastic cluster requires some devops. You can search for "setup elasticsearch 7 cluster" in the internet and there are some tutorials there. Stopping your docker-compose once in a certain period of time and backing up the /opt/trains/data folder is more straightforward and it would backup also the data that we store in mongodb.
Enjoy the new version!
Oh, I see:( it turned out that --extra-source-env option was not officially released yet. But the script that supports it can be downloaded from here: https://github.com/allegroai/trains-server/files/5080286/upgrade.zip
The index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b status is red. Meaning that the data for this index got corrupted. Since there are no replicas the only feasible option would be to delete this index. All the training scalars events for the old taskd would be lost then. But the newly created tasks should start working fine.curl -XDELETE
The volumes section of elasticsearch service looks OK to me:
/opt/trains/data/elastic_7:/usr/share/elasticsearch/data
Hi IdealPanda97 , can you please check your available disk space and available RAM? According to the logs all the services (Elastic, Mongo, Redis) fail to start
Ok, I see. Then you can enter the apiserver container:sudo docker exec -it clearml-apiserver /bin/bash
And run the following commands inside the containercurl -XGET
curl -XGET
IdealPanda97 Ok, I see. Can you please run the following command, then restart the docker-compose and see if it makes any difference?sudo chown -R 1000:1000 /opt/trains
Hi CooperativeFox72 , how much free space do you have on your disk now? If you run du on your /opt/trains/data/elastic_7 folder in let's say 5 mins intervals do you see the folder size is growing?
MassiveHippopotamus56 The data that you posted from the browser developers tool seems coming from the "Headers" tab. Can you please post the data from the "Payload" and "Response" tabs. This is in case you run in Chrome. In other browsers the tabs may have different names
Hi JitteryCoyote63 , are you still missing a month of data in the event logs? If you do cat indices do you see the same amount of docs in the original and the new ones?
I am not sure about the reasons. What you can do is to backup your /opt/trains/data folder periodically (preferably stopping the docker compose before it). Another possibility is to configure your elasticsearch to run as a cluster with 2 or more nodes on the same or different machine. This will allow elastic to replicate your indices to other nodes.
Oh, I see. Then maybe we can see some more info in the browser dev tools
Thanks, I think that I see the problem,
This explains the issue I think. The recovery path would be as follows:
Put down the running containers Restore both mongo and elastic data from the backup Run the old version docker containers and make sure that all the data is there Put down the containers Run the upgrade script Start the new version
Hi SarcasticSparrow10 , I am trying to understand whether we have some gaps in the instructions. In the upgrade process did you perform the steps 3-10 of the below instruction? Were there any errors when performing these steps?
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration
Ok, I see. And if you run a new experiment in the new version do you see its logs?
According to the sizes the data is there and ES sees it.
I mean it is not possible to open v3.6 data in version 4.4. That's why the steps 3-10 are there
Yes exactly, can you please verify that you use /home/orpat/trains/data/elastic_7 in the docker compose of 1.5?
Hi CooperativeFox72 , there was a typo in the index creation instructions ("comapny" instead of "company"). Please try the following sequence in mongo shell and then starting the apiserver:use auth db.user.createIndex({"name": 1, "company": 1})