Thanks, I think that I see the problem,
I mean it is not possible to open v3.6 data in version 4.4. That's why the steps 3-10 are there
Hi @<1558986867771183104:profile|ShakyKangaroo32> , can you please share the logs from the async_delete docker container?
Hi CooperativeFox72 , how much free space do you have on your disk now? If you run du on your /opt/trains/data/elastic_7 folder in let's say 5 mins intervals do you see the folder size is growing?
Here is the thread with solving the same issue: https://allegroai-trains.slack.com/archives/CTK20V944/p1596724607016500
Another option that should work for the upgrade script is to pass an environment variable that disable the xpack (the feature that requires licensing) for the ES5 docker container. It can done as following:
python elastic_upgrade.py --extra-source-env xpack.security.enabled=false
At some point we switched from Mongo DB v3.6 to v4.4. Upgrading from old versions require a migration of mongo data. Did you run the upgrade script as described below? Were there any errors?
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_mongo44_migration/
This explains the issue I think. The recovery path would be as follows:
Put down the running containers Restore both mongo and elastic data from the backup Run the old version docker containers and make sure that all the data is there Put down the containers Run the upgrade script Start the new version
Hi Elior, chances are that you do not have enough space for Elasticsearch on your storage. Please check the ES logs and increase the available disk space.
Hi QuaintJellyfish58 , I am investigating the issue. Can you please also send the request and response from projects.get_all when you are in the Team's Work view (the case where there is no undefined project)?
SubstantialBaldeagle49 This should collect the logs: 'sudo docker logs trains-apiserver >& apiserver.logs'
Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
It seems that elasticsearch is failing on any search request. Can you please run the following commands and share the results?curl -XGET
curl -XGET
Hi SoggyBeetle95 , from what version of clearml did you upgrade? About the tasks that disappeared: you do not see these tasks at all or you see these tasks with no results?
@<1585078752969232384:profile|FantasticDuck7> The best would be to copy this file to the host, edit it and map this file into the container instead of the original one. The single file mapping in the docker-compose file should look like this:
volumes:
- type: bind
source: <the path to the config file on the host>
target: /opt/clearml/apiserver/config/default/services/storage_credentials.conf
You should do it for the async_delete service. Not for the apise...
Ok, so there is no mapping for the whole config folder or specific config file that you changed. That's why async_delete does not get your updated configuration. You can do one of the following: either add here mapping for the specific file like you did earlier or map the whole config folder like apiserver service does:
- /opt/clearml/config:/opt/clearml/config
The second way is probably more flexible
There is a "License expired" message for the Elasticsearch 5. Try running the following command when your old trains docker is up:
http://localhost:9200/_xpack/license/start_basic
Oh, I see:( it turned out that --extra-source-env option was not officially released yet. But the script that supports it can be downloaded from here: https://github.com/allegroai/trains-server/files/5080286/upgrade.zip
Are you running your dockers on Linux or Windows?
Ok, it seems that elasticsearch ports are open for internal communication but not for the host. Can you please add the following section to elasticsearch service in docker compose and restart the dockers?ports: - "9200:9200"
After that the commands should work from host
It seems that index events-log-d1bd92a3b039400cbafc60a7a5b1e52b got corrupted. In case there are no backups the only choice would be to delete this index from elasticsearch
The index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b status is red. Meaning that the data for this index got corrupted. Since there are no replicas the only feasible option would be to delete this index. All the training scalars events for the old taskd would be lost then. But the newly created tasks should start working fine.curl -XDELETE
Are you running them on the computer that hosts the server docker containers. What is the port binding for elasticsearch in your docker compose?
Can you please run the following in the command line of the hosting server and share the results?curl -XGET
If you open the browser developer tools and navigate to the task console logs for one of the tasks that you do not get the logs anymore. Do you see any errors (red lines) in the api calls? Can you share the payload and response from the events.get_task_log call?
The tasks themselves will stay until you succeed to delete them from the client. Here we tried to see why deleting their data from ES timed out. From what I see no data was actually deleted (most likely because of the previous delete efforts that actually deleted the data though caused time out in the apiserver). What seems problematic is the amount of time that each operation took (19 and 16 seconds). It may be due to insufficient memory/cpu allocation to ES container or due to the 50Gb inde...
With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?
Hi QuaintJellyfish58 , it seems that we identified the problem. The undefined project that you see is not a real project. It is a placeholder where the statistics of ex-1 project should be shown. We found a bug in the apiserver that under these particular conditions fails to return the ex-1 project data so the placeholder remains empty (undefined). If I understand correctly it should only cause the inconvenience but not influence your workflow. Is it correct? We are fixing the issue in the n...