Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
Hi @<1523701260895653888:profile|QuaintJellyfish58> , we are in the final stages of preparing the hotfix version open-v1.14.1. It should be released this week
If you run the following command 'sudo chown -R 1000:1000 /opt/trains' does it change anything?
Hi ResponsiveCamel97 , the shards and indices stats look fine. Can you please try the async delete of the task data? You can run the following line in the shell inside the apiserver container. Just replace <task_id> with your actual task idcurl -XPOST -H "Content-Type: application/json" " " -d'{"query": {"term": {"task": "<task_id>"}}}'You should get in response something like this:{"task":"p6350SG7STmQALxH-E3CLg:1426125"}Then you can periodically ping ES on the status of the r...
Hi QuaintJellyfish58 , I am investigating the issue. Can you please also send the request and response from projects.get_all when you are in the Team's Work view (the case where there is no undefined project)?
Hi JitteryCoyote63 , are you still missing a month of data in the event logs? If you do cat indices do you see the same amount of docs in the original and the new ones?
Hi DefeatedCrab47 , ES docker requires that it is data folder belongs to 1000:1000 user and group. If you want to transfer your existing data from trains 15.1 then please follow the guide https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/
The script that is run in this guide should create elastic_7 folder with the correct permissions and transfer all your existing data
If you open the browser developer tools and navigate to the task console logs for one of the tasks that you do not get the logs anymore. Do you see any errors (red lines) in the api calls? Can you share the payload and response from the events.get_task_log call?
Hi @<1523701260895653888:profile|QuaintJellyfish58> , we found and fixed the #228 issue. It will be released in the coming patch for open-v1.14
Hi VexedPeacock35 , I suspect that Elasticsearch works too hard and periodically misses timeouts on recording events. How much memory and CPU is it using? Can you increase the memory that is allocated to it and see whether this helps?
Hi QuaintJellyfish58 , it seems that we identified the problem. The undefined project that you see is not a real project. It is a placeholder where the statistics of ex-1 project should be shown. We found a bug in the apiserver that under these particular conditions fails to return the ex-1 project data so the placeholder remains empty (undefined). If I understand correctly it should only cause the inconvenience but not influence your workflow. Is it correct? We are fixing the issue in the n...
Ok, I see. And if you run a new experiment in the new version do you see its logs?
HiĀ ImmenseMole52 , did you do any changes in the docker compose file? If yes, then can you please send your version of the file?
curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d'{"persistent" : {"search.max_open_scroll_context": 1000}}'
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE ' ' curl -XDELETE ' 'For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elasticĀ > log.txt 2>&1
Hi @<1523707653782507520:profile|MelancholyElk85> , what version of the apiserver are you using?
Thanks, I think that I see the problem,
Yes exactly, can you please verify that you use /home/orpat/trains/data/elastic_7 in the docker compose of 1.5?
Are you sure that it was performed fully according to the suggested sequence? The error that you posted says that v3.6 data is incompatible with v4.4 and suggests version 4.2 or earlier. Step 3 starts with mongo 4.0 that should be able to open v3.6 data. And then a number of gradual updates through versions 4.0->4.2->4.4 is performed
Hi @<1523701260895653888:profile|QuaintJellyfish58> . For the issue #229: we found and fixed the problem. The fix will be available in the coming patch for the v1.14 release. For the issue 228 I requested more info from you in the github
Hi @<1523701868901961728:profile|ReassuredTiger98> , how exactly do you override the values in storage_credentials file? Do you prepare a new docker image with the changed file or map this file from outside with the volume mapping in the docker compose or through the env variables? What is also important is that you do this override for the async_delete service. It is the service that actually uses the storage credentials. Not the apiserver itself
SubstantialBaldeagle49 This is fine. When you start docker-compose it takes different time for the services to start. Apiserver waits for the Elasticsearch to start and proceeds once it is ready. Can you reproduce the buckets issue and share the apiserver log that contains it?
@<1523701066867150848:profile|JitteryCoyote63> The requirements list the client library that apiserver uses to access the Elasticsearch. This library is capable of working with both Elasticsearch 7 and 8
The index "events-plot-d1bd92a3b039400cbafc60a7a5b1e52b" is red meaning that it is corrupted and elastic cannot work with it. The most straightforward solution would be to delete this index but it will result in all the plots generated so far will be lost.
This explains the issue I think. The recovery path would be as follows:
Put down the running containers Restore both mongo and elastic data from the backup Run the old version docker containers and make sure that all the data is there Put down the containers Run the upgrade script Start the new version
Can you please run the following in the command line of the hosting server and share the results?curl -XGET
Hi SoggyBeetle95 , from what version of clearml did you upgrade? About the tasks that disappeared: you do not see these tasks at all or you see these tasks with no results?
At some point we switched from Mongo DB v3.6 to v4.4. Upgrading from old versions require a migration of mongo data. Did you run the upgrade script as described below? Were there any errors?
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration/
Sure, you delete it with the following command:
curl -XDELETE " http://localhost:9200/events-plot-d1bd92a3b039400cbafc60a7a5b1e52b "
Once deleted it will be automatically recreated by the api server and should see the plots from the new tasks that you run afterwards