Hi David,
What version of ClearML server & SDK are you using?
Hi John,
The SDK version is: 1.4.1 (found it using pip list)
Please tell me how to find the ClearMl server version
In the web UI you can click the settings icon at the top right -> settings. At that screen the version should be shown at the bottom right
WebApp: 1.5.0-192 • Server: 1.5.0-192 • API: 2.18
From what I understand, by default the ES has a low disk waterkmark set at 95% of the disk capacity. Once reached the shard is transitioned to a read only mode. Since you have a large disk of 1.8Tb the remaining 85Gb is below the 5%.
Basically you need to set the following env vars in elasticsearch service in the docker compose:- cluster.routing.allocation.disk.watermark.low=10gb - cluster.routing.allocation.disk.watermark.high=10gb - cluster.routing.allocation.disk.watermark.flood_stage=10gb
I’m trying to find the compose file,
Any suggestions ?
docker-compose.yml
file you used to set up the server
I’m not sure where it is,
I tried to run the following:docker container inspect clearml-elastic | grep compose "com.docker.compose.config-hash": "5563c3436de5f0c5e3349da7f0511ea7a8020ce4ce5dc22d161f0561e973fecd", "com.docker.compose.container-number": "1", "com.docker.compose.oneoff": "False", "com.docker.compose.project": "clearml", "com.docker.compose.service": "elasticsearch", "com.docker.compose.version": "1.24.1",
The compose file should be inside the container ?
Not sure - I’m not the one who installed it.
But the values are smaller then 10 gb (it is 500 mb):cluster.routing.allocation.disk.watermark.low: 500mb cluster.routing.allocation.disk.watermark.high: 500mb cluster.routing.allocation.disk.watermark.flood_stage: 500mb
What happens if you use the settings I pasted?
I tried it (including docker down and up)
Currently giving the same errors as follows:
2022-08-15 16:15:19,676 - clearml.Task - ERROR - Action failed <500/100: tasks.reset/v1.0 (General data error (TransportError(503, 'search_phase_execution_exception')))> (force=False, clear_all=False, task=f535ef2c45cf4baaaf13c8f2fe2ac19a)
2022-08-15 16:16:19,845 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039 400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '7ddf4ca9846040dabc5814b83d0935a6', 'status': 503, 'error': {'type':..., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], reques t: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][_doc][7ddf4ca9846040dabc5814b83d0935a6], source[_na_]}] and a refresh])>)
Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?curl -XGET
curl -XGET
Sure I will do it tomorrow
Thanks for the help in the meantime
Hi Evgeny,
I tried to run the curl command, it gave me the following error:curl: (7) Failed to connect to localhost port 9200: Connection refused
Are you running them on the computer that hosts the server docker containers. What is the port binding for elasticsearch in your docker compose?
Yes I run the command from the server that host the container.
In the environment section I have:CLEARML_ELASTIC_SERVICE_PORT: 9200
In the apiserver section I have the following:ports: - "8008:8008"
In the fileserver section I have :ports: - "8081:8081"
Ok, it seems that elasticsearch ports are open for internal communication but not for the host. Can you please add the following section to elasticsearch service in docker compose and restart the dockers?ports: - "9200:9200"
After that the commands should work from host
Done,
I got the following outputs:health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .geoip_databases Mshu2LugQ2aQYmFTB8Lckg 1 0 44 74 76.8mb 76.8mb green open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 7qIRfRjNRke6GuHZzRKsuw 1 0 71382012 565576 8.3gb 8.3gb red open events-log-d1bd92a3b039400cbafc60a7a5b1e52b QamfEch8RTeSDibf25iIOw 1 0 green open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b ULSBPv_jRUqJfLhBWUonIA 1 0 5998 0 1.4mb 1.4mb
index shard prirep state docs store ip node .ds-ilm-history-5-2022.07.13-000002 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.08.08-000005 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.07.11-000003 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.07.25-000004 0 p STARTED 192.168.64.5 clearml .geoip_databases 0 p STARTED 41 109.7mb 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.27-000002 0 p STARTED 192.168.64.5 clearml events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 5998 1.4mb 192.168.64.5 clearml events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 71382012 8.3gb 192.168.64.5 clearml events-log-d1bd92a3b039400cbafc60a7a5b1e52b 0 p UNASSIGNED .ds-ilm-history-5-2022.06.13-000001 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.13-000001 0 p STARTED 192.168.64.5 clearml .ds-ilm-history-5-2022.08.12-000003 0 p STARTED 192.168.64.5 clearml
It seems that index events-log-d1bd92a3b039400cbafc60a7a5b1e52b got corrupted. In case there are no backups the only choice would be to delete this index from elasticsearch
Which means the experiments will be deleted right ?
And how should I delete the index ?
Actually the task logs will be lost. The tasks themselves and their reported metrics and plots would stay. The command is the following:curl -XDELETE localhost:9200/events-log-d1bd92a3b039400cbafc60a7a5b1e52b
Thanks - it works :)
It happen for the second time in the last couple of months,
Do you have any suggestions regarding why it’s happening and how to make sure it won’t happen again ?
👍 I would say either deploying elasticsearch cluster consisting of several nodes with replication or doing daily backups:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/snapshot-restore.html
Apart from it is worth making sure that ES is running in a stable environment (no abrupt restarts) and with enough RAM.