Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I’M Running Experiments Using Clearml. The Training Is Very Slow, And I’M Getting The Following Errors And Warnings:

Hi all,
I’m running experiments using ClearML.
The training is very slow, and I’m getting the following errors and warnings:
clearml.Task - ERROR - Action failed <500/100: tasks.reset/v1.0 (General data error (TransportError(503, 'search_phase_execution_exception')))> (force=False, clear_all=False, task=ff5a7549a47a4e9392ef6d6c8
2022-08-15 14:17:09,713 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '448b9e2c82a3435fa22cd75226a579b5', 'status': 503, 'error': {'type':..., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][_doc][448b9e2c82a3435fa22cd75226a579b5], source[n/a, actual length: [5.9kb], max length: 2kb]}] and a refresh])>)
I checked the disk space and I have free space of 83 GB out of 1.8 TB.

Any suggestions ?
Thanks

  
  
Posted 2 years ago
Votes Newest

Answers 27


Hi David,

What version of ClearML server & SDK are you using?

  
  
Posted 2 years ago

Hi John,
The SDK version is: 1.4.1 (found it using pip list)
Please tell me how to find the ClearMl server version

  
  
Posted 2 years ago

In the web UI you can click the settings icon at the top right -> settings. At that screen the version should be shown at the bottom right

  
  
Posted 2 years ago

WebApp: 1.5.0-192 • Server: 1.5.0-192 • API: 2.18

  
  
Posted 2 years ago

From what I understand, by default the ES has a low disk waterkmark set at 95% of the disk capacity. Once reached the shard is transitioned to a read only mode. Since you have a large disk of 1.8Tb the remaining 85Gb is below the 5%.

Basically you need to set the following env vars in elasticsearch service in the docker compose:
- cluster.routing.allocation.disk.watermark.low=10gb - cluster.routing.allocation.disk.watermark.high=10gb - cluster.routing.allocation.disk.watermark.flood_stage=10gb

  
  
Posted 2 years ago

I’m trying to find the compose file,
Any suggestions ?

  
  
Posted 2 years ago

docker-compose.yml file you used to set up the server

  
  
Posted 2 years ago

I’m not sure where it is,
I tried to run the following:
docker container inspect clearml-elastic | grep compose "com.docker.compose.config-hash": "5563c3436de5f0c5e3349da7f0511ea7a8020ce4ce5dc22d161f0561e973fecd", "com.docker.compose.container-number": "1", "com.docker.compose.oneoff": "False", "com.docker.compose.project": "clearml", "com.docker.compose.service": "elasticsearch", "com.docker.compose.version": "1.24.1",The compose file should be inside the container ?

  
  
Posted 2 years ago

How did you install your clearml server?

  
  
Posted 2 years ago

Not sure - I’m not the one who installed it.

  
  
Posted 2 years ago

I found it

  
  
Posted 2 years ago

But the values are smaller then 10 gb (it is 500 mb):
cluster.routing.allocation.disk.watermark.low: 500mb cluster.routing.allocation.disk.watermark.high: 500mb cluster.routing.allocation.disk.watermark.flood_stage: 500mb

  
  
Posted 2 years ago

What happens if you use the settings I pasted?

  
  
Posted 2 years ago

I tried it (including docker down and up)
Currently giving the same errors as follows:

2022-08-15 16:15:19,676 - clearml.Task - ERROR - Action failed <500/100: tasks.reset/v1.0 (General data error (TransportError(503, 'search_phase_execution_exception')))> (force=False, clear_all=False, task=f535ef2c45cf4baaaf13c8f2fe2ac19a)
2022-08-15 16:16:19,845 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039 400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '7ddf4ca9846040dabc5814b83d0935a6', 'status': 503, 'error': {'type':..., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], reques t: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][_doc][7ddf4ca9846040dabc5814b83d0935a6], source[_na_]}] and a refresh])>)

  
  
Posted 2 years ago

Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?
curl -XGET curl -XGET

  
  
Posted 2 years ago

Sure I will do it tomorrow
Thanks for the help in the meantime

  
  
Posted 2 years ago

Hi Evgeny,
I tried to run the curl command, it gave me the following error:
curl: (7) Failed to connect to localhost port 9200: Connection refused

  
  
Posted 2 years ago

Are you running them on the computer that hosts the server docker containers. What is the port binding for elasticsearch in your docker compose?

  
  
Posted 2 years ago

Yes I run the command from the server that host the container.
In the environment section I have:
CLEARML_ELASTIC_SERVICE_PORT: 9200
In the apiserver section I have the following:
ports: - "8008:8008"
In the fileserver section I have :
ports: - "8081:8081"

  
  
Posted 2 years ago

Ok, it seems that elasticsearch ports are open for internal communication but not for the host. Can you please add the following section to elasticsearch service in docker compose and restart the dockers?
ports: - "9200:9200"After that the commands should work from host

  
  
Posted 2 years ago

Done,
I got the following outputs:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .geoip_databases Mshu2LugQ2aQYmFTB8Lckg 1 0 44 74 76.8mb 76.8mb green open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 7qIRfRjNRke6GuHZzRKsuw 1 0 71382012 565576 8.3gb 8.3gb red open events-log-d1bd92a3b039400cbafc60a7a5b1e52b QamfEch8RTeSDibf25iIOw 1 0 green open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b ULSBPv_jRUqJfLhBWUonIA 1 0 5998 0 1.4mb 1.4mb
index shard prirep state docs store ip node .ds-ilm-history-5-2022.07.13-000002 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.08.08-000005 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.07.11-000003 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.07.25-000004 0 p STARTED 192.168.64.5 clearml .geoip_databases 0 p STARTED 41 109.7mb 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.27-000002 0 p STARTED 192.168.64.5 clearml events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 5998 1.4mb 192.168.64.5 clearml events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 71382012 8.3gb 192.168.64.5 clearml events-log-d1bd92a3b039400cbafc60a7a5b1e52b 0 p UNASSIGNED .ds-ilm-history-5-2022.06.13-000001 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.13-000001 0 p STARTED 192.168.64.5 clearml .ds-ilm-history-5-2022.08.12-000003 0 p STARTED 192.168.64.5 clearml

  
  
Posted 2 years ago

It seems that index events-log-d1bd92a3b039400cbafc60a7a5b1e52b got corrupted. In case there are no backups the only choice would be to delete this index from elasticsearch

  
  
Posted 2 years ago

Which means the experiments will be deleted right ?
And how should I delete the index ?

  
  
Posted 2 years ago

Actually the task logs will be lost. The tasks themselves and their reported metrics and plots would stay. The command is the following:
curl -XDELETE localhost:9200/events-log-d1bd92a3b039400cbafc60a7a5b1e52b

  
  
Posted 2 years ago

Thanks - it works :)
It happen for the second time in the last couple of months,
Do you have any suggestions regarding why it’s happening and how to make sure it won’t happen again ?

  
  
Posted 2 years ago

👍 I would say either deploying elasticsearch cluster consisting of several nodes with replication or doing daily backups:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/snapshot-restore.html
Apart from it is worth making sure that ES is running in a stable environment (no abrupt restarts) and with enough RAM.

  
  
Posted 2 years ago

Ok Thanks

  
  
Posted 2 years ago