Hi SarcasticSparrow10 , I am trying to understand whether we have some gaps in the instructions. In the upgrade process did you perform the steps 3-10 of the below instruction? Were there any errors when performing these steps?
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration
@<1523701066867150848:profile|JitteryCoyote63> The requirements list the client library that apiserver uses to access the Elasticsearch. This library is capable of working with both Elasticsearch 7 and 8
Hi H4dr1en, there is a chance that the problem is that in parallel reindexing of data. You can try to replace parallel=max(docker_resources.cpus // 2, 1)
at line 190 with
parallel=1
I think you will need to remove the /opt/trains/data/elastic_7 folder before script restart
We can compare with the table that you sent yesterday. Unless a lot of new events were written since then
We just uploaded the new update script into
https://github.com/allegroai/trains-server/releases/download/0.16.0/trains-server-0.16.0-migration.zip
It has several improvements and there is a chance that it will overcome the issue that you are facing. Also, please check that you have enough disk space for copying of ES data.
Are you running your dockers on Linux or Windows?
Can you share all the error info that you get in the network tab?
Hi JitteryCoyote63 , are you still missing a month of data in the event logs? If you do cat indices do you see the same amount of docs in the original and the new ones?
The volumes section of elasticsearch service looks OK to me:
/opt/trains/data/elastic_7:/usr/share/elasticsearch/data
Hi QuaintJellyfish58 , I am investigating the issue. Can you please also send the request and response from projects.get_all when you are in the Team's Work view (the case where there is no undefined project)?
Hi UnevenDolphin73 . how many artifacts do you have on this task? We are storing task metadata in Mongo and there is a limit of 16Mb per a single document. While the artifact itself is not stored under the task there is some metadata (notably the uri and display_data/preview) that is stored for each artifact
Are you running them on the computer that hosts the server docker containers. What is the port binding for elasticsearch in your docker compose?
👍 I would say either deploying elasticsearch cluster consisting of several nodes with replication or doing daily backups:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/snapshot-restore.html
Apart from it is worth making sure that ES is running in a stable environment (no abrupt restarts) and with enough RAM.
This one is indeed dynamic but can be set as follows: "plot_len":{"type":"long"}
Hi ExasperatedCrocodile76 , what version of the clearml server are you using? You can see it in the bottom right corner of the Settings screen
The index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b status is red. Meaning that the data for this index got corrupted. Since there are no replicas the only feasible option would be to delete this index. All the training scalars events for the old taskd would be lost then. But the newly created tasks should start working fine.curl -XDELETE
I mean it is not possible to open v3.6 data in version 4.4. That's why the steps 3-10 are there
Hi WittyOwl57 , there is a chance that the reason is in this setting: Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log ...
First it say says about invalid log option that may require further investigation. Second the file that it tries to write to is logs/gc.log and it is not under the $clearml folder where you give the write permissions to the ES user. I would try cancelling JVM logging at all or specifying the full path to the file so that it would be under the folder that has 1000:1000 o...
Hi @<1523701868901961728:profile|ReassuredTiger98> , what version of the apiserver are you using?
There should be a log file in the directory where you run the script. It contains more info. Can you please send me the log?
Can you run 'ls -al' in the /opt/trains/data folder and also in the /opt/trains/data/elastic_7 folder and send the output?
If you run the following command 'sudo chown -R 1000:1000 /opt/trains' does it change anything?
@<1673863788857659392:profile|HomelyRabbit25> What happens when you delete the files from UI? Can you please share the logs from the async_delete service? This is the service that is actually responsible for the files deletion and the s3 configuration that you prepared should be mapped into that service (not the apiserver)
Hi ImmenseMole52 , did you do any changes in the docker compose file? If yes, then can you please send your version of the file?
Did you try restarting the docker compose since the problem start happening?
There is a "License expired" message for the Elasticsearch 5. Try running the following command when your old trains docker is up:
http://localhost:9200/_xpack/license/start_basic
Sorry, I did not write it properly. You need to run the following curl command from the command line:
curl -XPOST ' http://localhost:9200/_xpack/license/start_basic '
If it returns an OK result then rerun the upgrade process again.
For running of the old version of Trains the same setting can be added to elasticsearch environment section in the docker compose
Enjoy the new version!