Hi CooperativeFox72 , there was a typo in the index creation instructions ("comapny" instead of "company"). Please try the following sequence in mongo shell and then starting the apiserver:use auth db.user.createIndex({"name": 1, "company": 1})
@<1523701868901961728:profile|ReassuredTiger98> Strange:( in 1.10 we already had the code for clearing ES scrolls created during the task deletion. I would recommend upgrading to the latest release v1.12.1 anyway. In addition you can instruct ES to allow more open scrolls like below. By default it is limited to 500.
Are you running your dockers on Linux or Windows?
@<1668065560107159552:profile|VivaciousPenguin20> Did you re import the example projects after upgrading to v1.14.1? The problem was in the import procedure itself. The tasks that were imported in the previous versions will not have task results
Hi SteadyFox10 , how many unique metrics and variants do you have in this task? We may be hitting some limit here
We just uploaded the new update script into
https://github.com/allegroai/trains-server/releases/download/0.16.0/trains-server-0.16.0-migration.zip
It has several improvements and there is a chance that it will overcome the issue that you are facing. Also, please check that you have enough disk space for copying of ES data.
Hi DefeatedCrab47 , ES docker requires that it is data folder belongs to 1000:1000 user and group. If you want to transfer your existing data from trains 15.1 then please follow the guide https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/
The script that is run in this guide should create elastic_7 folder with the correct permissions and transfer all your existing data
Hi SoggyBeetle95 , from what version of clearml did you upgrade? About the tasks that disappeared: you do not see these tasks at all or you see these tasks with no results?
If it returns an OK result then rerun the upgrade process again.
Strange:( What version of ElasticSearch do you currently use (before the upgrade)? Can you please share your docker compose file?
Hi @<1558986867771183104:profile|ShakyKangaroo32> , can you please share the logs from the async_delete docker container?
IdealPanda97 It seems that expired ES5 license is the reason for both the upgrade failing and for inability to run the Trains v0.15. The license is free but the ways to renew it are different between ES5 and ES6/7. For the ES5 the procedure is more complicated and described in the medium article that I sent earlier. In the attached thread another user has applied it and it solved the issue. The article describe 2 possible solutions: turning of the xpack when running Elastic and retrieving th...
Hi @<1523701457835003904:profile|AbruptHedgehog21> can you please share the logs from the async_delete service? It is responsible for the actual deletion of the data
Yes, it is safe to put number_of_replicas to 0 and refresh_interval to -1 for the target index before the reindex and then put them back after the reindex is finished
It seems that elasticsearch is failing on any search request. Can you please run the following commands and share the results?curl -XGET curl -XGET
@<1673863788857659392:profile|HomelyRabbit25> We are planning to release a new version v1.15 in a few days that will support this job in helm charts. Currently this option does not exist in K8s deployment and the apiserver is not deleting task artifacts from external storages
Hi @<1673863788857659392:profile|HomelyRabbit25> , yes it should include the support for async_delete service. Please provide the storage_credentials configuration to this service instead of the apiserver. For the details of whether the deletion works or it has any issues with the provided configuration please inspect the logs from the async_delete pod.
@<1673863788857659392:profile|HomelyRabbit25> What happens when you delete the files from UI? Can you please share the logs from the async_delete service? This is the service that is actually responsible for the files deletion and the s3 configuration that you prepared should be mapped into that service (not the apiserver)
Hi UnevenDolphin73 . how many artifacts do you have on this task? We are storing task metadata in Mongo and there is a limit of 16Mb per a single document. While the artifact itself is not stored under the task there is some metadata (notably the uri and display_data/preview) that is stored for each artifact
No, there was a problem with the particular version migration. The temporary index creation allowed to this and all subsequent migrations to run successfully. So for now your DB is properly aligned with the latest ClearML and the future upgrades should work fine.
We can compare with the table that you sent yesterday. Unless a lot of new events were written since then
Hi SarcasticSparrow10 , I am trying to understand whether we have some gaps in the instructions. In the upgrade process did you perform the steps 3-10 of the below instruction? Were there any errors when performing these steps?
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration
According to the sizes the data is there and ES sees it.
The index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b status is red. Meaning that the data for this index got corrupted. Since there are no replicas the only feasible option would be to delete this index. All the training scalars events for the old taskd would be lost then. But the newly created tasks should start working fine.curl -XDELETE
There is a "License expired" message for the Elasticsearch 5. Try running the following command when your old trains docker is up:
http://localhost:9200/_xpack/license/start_basic
Can you please run the following in the command line of the hosting server and share the results?curl -XGET
I would backup the dbs prior to the upgrade so that you can rollback in case any issue arise in the upgrade process
Great! What error do you still see in UI when comparing more than 20 experiments? At the time of error do you see any error response from the apiserver (in the browser network tab)? When the call to compare of 20+ task metrics succeed how much time does it usually takes in your environment?
Yes, the command would be like this: curl -XDELETE " http://localhost:9200/queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08 "
If you decide to delete the "red" indices then you can proceed with the command above issuing it for each problematic index. The queue metrics index is not very important but the second one "events-logs" contains all the log messages produced by your tasks in August. You will still have debug images and scalar metrics reported by these tasks but the log messages ...