IdealPanda97 Ok, I see. Can you please run the following command, then restart the docker-compose and see if it makes any difference?sudo chown -R 1000:1000 /opt/trains
Hi DilapidatedDucks58 , I am trying to reproduce the "Connection is full warning". Do you override any apiserver environment variables is docker compose? If yes then can you share your version of docker-compose? Do you provide a configuration file for gunicorn? Can you please share it?
What about the UID for epdadmin user? 'id -u epdadmin'
Just a moment, it seems that this api is supported only on ES6 or newer. From the other discussion in this channel: for ES5 you have to download the basic license and install it as described in this article: https://medium.com/@ospaarmann/tidbits-solving-the-elasticsearch-x-pack-license-issue-in-docker-d15bb22d82fd
IdealPanda97 What can be seen now is that some of the indices (at least queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08) are in the corrupted state. This can be the result of abnormal termination of ES or some other situation. The queue metrics index is not particularly important but there maybe other indices that are also corrupted. To map the cluster and indices state you can issue the following commands (with the running ES5 docker container). Look for the "red" statuses in the out...
Hi @<1585078752969232384:profile|FantasticDuck7> , there is an apiserver configuration file apiserver->config->default->services->storage_credentials.conf
It contains the parameters for accessing files on the external storages like s3, google or azure. Please provide the same minio server access parameters as you do for the SDK configuration.
The actual deletion is performed by the async_delete service. You can inspect its logs with "sudo docker logs async_delete" command. Before configuring...
Can you try deleting the application cookie? While being on the trains page in the browser devtools you navigate to Application->Cookies and under it delete any trains cookies that are there. I believe that you will need to login after that
Hi CooperativeFox72 , how much free space do you have on your disk now? If you run du on your /opt/trains/data/elastic_7 folder in let's say 5 mins intervals do you see the folder size is growing?
Thanks! In this log it mentions that the source elastic 5 has failed during the reindex process. Can you also share the logs from the 'elastic-upgrade' service?
Here is the thread with solving the same issue: https://allegroai-trains.slack.com/archives/CTK20V944/p1596724607016500
Another option that should work for the upgrade script is to pass an environment variable that disable the xpack (the feature that requires licensing) for the ES5 docker container. It can done as following:
python elastic_upgrade.py --extra-source-env xpack.security.enabled=false
Hi IdealPanda97 , can you share the logs for the 'elastic-upgrade-7' docker container? According to the upgrade log there was some problem with Elasticsearch during indices copy.
Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?curl -XGET curl -XGET
Setting up an elastic cluster requires some devops. You can search for "setup elasticsearch 7 cluster" in the internet and there are some tutorials there. Stopping your docker-compose once in a certain period of time and backing up the /opt/trains/data folder is more straightforward and it would backup also the data that we store in mongodb.
Do you mean the "search_phase_execution" error? Yes, stopping containers, deleting the data folder and running the containers again would bring you to a "clean install" state. But then you would loose all your data not only the task scalar results
With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?
SubstantialBaldeagle49 The log looks OK. Where do you see the error?
The tasks themselves will stay until you succeed to delete them from the client. Here we tried to see why deleting their data from ES timed out. From what I see no data was actually deleted (most likely because of the previous delete efforts that actually deleted the data though caused time out in the apiserver). What seems problematic is the amount of time that each operation took (19 and 16 seconds). It may be due to insufficient memory/cpu allocation to ES container or due to the 50Gb inde...
Hi QuaintJellyfish58 in the latest data that you sent I see only the responses (some of them are marked as payloads but they are actually responses). What would be very interesting is to see the requests (payloads) that resulted in the following empty responses:
` # response
{"meta":{"id":"aaaffe49ace64f1a8b0211925afcfd32","trx":"aaaffe49ace64f1a8b0211925afcfd32","endpoint":{"name":"projects.get_all_ex","requested_version":"2.20","actual_version":"1.0"},"result_code":200,"result_subcode":0,...
I am not sure about the reasons. What you can do is to backup your /opt/trains/data folder periodically (preferably stopping the docker compose before it). Another possibility is to configure your elasticsearch to run as a cluster with 2 or more nodes on the same or different machine. This will allow elastic to replicate your indices to other nodes.
Hi MassiveHippopotamus56
Can you please open the browser developer tools, navigate to scalar tabs for one of the experiments that show wrong iteration and copy here the request payload and response for the events.scala_metrics_iter_histogram call?
Hi Elior, chances are that you do not have enough space for Elasticsearch on your storage. Please check the ES logs and increase the available disk space.
Hi @<1668065560107159552:profile|VivaciousPenguin20> , what version of the apiserver are you running? Can you please try switching to the latest v1.14.1 version that was released last week. One of the issues fixed was the inability to import events for the published example tasks
Do you see any error in the browser network tab?
SubstantialBaldeagle49 This should collect the logs: 'sudo docker logs trains-apiserver >& apiserver.logs'
@<1585078752969232384:profile|FantasticDuck7> The best would be to copy this file to the host, edit it and map this file into the container instead of the original one. The single file mapping in the docker-compose file should look like this:
volumes:
- type: bind
source: <the path to the config file on the host>
target: /opt/clearml/apiserver/config/default/services/storage_credentials.conf
You should do it for the async_delete service. Not for the apise...
@<1585078752969232384:profile|FantasticDuck7> What volume mappings do you have for the async_delete service in the docker-compose.yaml file?
The data that you sent looks fine. It seems that you actually has these iterations in Elasticsearch. To check whether it is the case please run the following command in the shell on your host. You should get the first 10 task events with the smallest iterations:curl -XGET -H "Content-Type: application/json" localhost:9200/events-training_stats_scalar*/_search?pretty -d' { "query": { "term": {"task": "d45ecb5ad7084175bd83dd39777b10c5"} }, "sort": {"iter": "asc"} }'