Hi IdealPanda97 , can you share the logs for the 'elastic-upgrade-7' docker container? According to the upgrade log there was some problem with Elasticsearch during indices copy.
Are you running your dockers on Linux or Windows?
Hi SteadyFox10 , how many unique metrics and variants do you have in this task? We may be hitting some limit here
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elasticĀ > log.txt 2>&1
Hi MortifiedDove27 , you can run the following commands on the clearml server host to get the docker logs for the apiserver and elasticsearch:sudo docker logs clearml-apiserver > apiserver.logs 2>&1 sudo docker logs clearml-elastic > elastic.logs 2>&1
Glad to hear that it helped:)
Do you mean the "search_phase_execution" error? Yes, stopping containers, deleting the data folder and running the containers again would bring you to a "clean install" state. But then you would loose all your data not only the task scalar results
SubstantialBaldeagle49 The log looks OK. Where do you see the error?
Hi @<1523701260895653888:profile|QuaintJellyfish58> . For the issue #229: we found and fixed the problem. The fix will be available in the coming patch for the v1.14 release. For the issue 228 I requested more info from you in the github
Ok, I see. Then you can enter the apiserver container:sudo docker exec -it clearml-apiserver /bin/bash
And run the following commands inside the containercurl -XGET
curl -XGET
The index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b status is red. Meaning that the data for this index got corrupted. Since there are no replicas the only feasible option would be to delete this index. All the training scalars events for the old taskd would be lost then. But the newly created tasks should start working fine.curl -XDELETE
Hi @<1523701457835003904:profile|AbruptHedgehog21> can you please share the logs from the async_delete service? It is responsible for the actual deletion of the data
Hi @<1523701868901961728:profile|ReassuredTiger98> , how exactly do you override the values in storage_credentials file? Do you prepare a new docker image with the changed file or map this file from outside with the volume mapping in the docker compose or through the env variables? What is also important is that you do this override for the async_delete service. It is the service that actually uses the storage credentials. Not the apiserver itself
Hi @<1523701260895653888:profile|QuaintJellyfish58> , we are in the final stages of preparing the hotfix version open-v1.14.1. It should be released this week
Setting up an elastic cluster requires some devops. You can search for "setup elasticsearch 7 cluster" in the internet and there are some tutorials there. Stopping your docker-compose once in a certain period of time and backing up the /opt/trains/data folder is more straightforward and it would backup also the data that we store in mongodb.
I am not sure about the reasons. What you can do is to backup your /opt/trains/data folder periodically (preferably stopping the docker compose before it). Another possibility is to configure your elasticsearch to run as a cluster with 2 or more nodes on the same or different machine. This will allow elastic to replicate your indices to other nodes.
Hi @<1585078752969232384:profile|FantasticDuck7> , there is an apiserver configuration file apiserver->config->default->services->storage_credentials.conf
It contains the parameters for accessing files on the external storages like s3, google or azure. Please provide the same minio server access parameters as you do for the SDK configuration.
The actual deletion is performed by the async_delete service. You can inspect its logs with "sudo docker logs async_delete" command. Before configuring...
@<1585078752969232384:profile|FantasticDuck7> What volume mappings do you have for the async_delete service in the docker-compose.yaml file?
Hi ExasperatedCrocodile76 , what version of the clearml server are you using? You can see it in the bottom right corner of the Settings screen
IdealPanda97 Is your user id 1000? If not then this maybe the reason and chown -R 1000:1000 may help. Elasticsearch in the docker runs with user 1000. Another reason maybe some other elasticsearch process or docker running on your machine and holding the lock in the data folder. If there are any then please try stopping them. If neither of the above helps then there is an option of manually deleting .lock files from the elastic data folder. Of course the data should be backed up before this....
@<1523701066867150848:profile|JitteryCoyote63> The requirements list the client library that apiserver uses to access the Elasticsearch. This library is capable of working with both Elasticsearch 7 and 8
@<1668065560107159552:profile|VivaciousPenguin20> Did you re import the example projects after upgrading to v1.14.1? The problem was in the import procedure itself. The tasks that were imported in the previous versions will not have task results
Enjoy the new version!
The data that you sent looks fine. It seems that you actually has these iterations in Elasticsearch. To check whether it is the case please run the following command in the shell on your host. You should get the first 10 task events with the smallest iterations:curl -XGET -H "Content-Type: application/json" localhost:9200/events-training_stats_scalar*/_search?pretty -d' { "query": { "term": {"task": "d45ecb5ad7084175bd83dd39777b10c5"} }, "sort": {"iter": "asc"} }'
MassiveHippopotamus56 The data that you posted from the browser developers tool seems coming from the "Headers" tab. Can you please post the data from the "Payload" and "Response" tabs. This is in case you run in Chrome. In other browsers the tabs may have different names
Hi CooperativeFox72 , how much free space do you have on your disk now? If you run du on your /opt/trains/data/elastic_7 folder in let's say 5 mins intervals do you see the folder size is growing?
The volumes section of elasticsearch service looks OK to me:
/opt/trains/data/elastic_7:/usr/share/elasticsearch/data
We just uploaded the new update script into
https://github.com/allegroai/trains-server/releases/download/0.16.0/trains-server-0.16.0-migration.zip
It has several improvements and there is a chance that it will overcome the issue that you are facing. Also, please check that you have enough disk space for copying of ES data.
Hi @<1668065560107159552:profile|VivaciousPenguin20> , what version of the apiserver are you running? Can you please try switching to the latest v1.14.1 version that was released last week. One of the issues fixed was the inability to import events for the published example tasks
Hi MassiveHippopotamus56
Can you please open the browser developer tools, navigate to scalar tabs for one of the experiments that show wrong iteration and copy here the request payload and response for the events.scala_metrics_iter_histogram call?