Hi All, I’M Running Experiments Using Clearml. The Training Is Very Slow, And I’M Getting The Following Errors And Warnings:

Answered

Hi all,
I’m running experiments using ClearML.
The training is very slow, and I’m getting the following errors and warnings:
clearml.Task - ERROR - Action failed <500/100: tasks.reset/v1.0 (General data error (TransportError(503, 'search_phase_execution_exception')))> (force=False, clear_all=False, task=ff5a7549a47a4e9392ef6d6c8
2022-08-15 14:17:09,713 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '448b9e2c82a3435fa22cd75226a579b5', 'status': 503, 'error': {'type':..., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][_doc][448b9e2c82a3435fa22cd75226a579b5], source[n/a, actual length: [5.9kb], max length: 2kb]}] and a refresh])>)
I checked the disk space and I have free space of 83 GB out of 1.8 TB.

Any suggestions ?
Thanks

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Votes Newest

Answers 27

Hi David,

What version of ClearML server & SDK are you using?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi John,
The SDK version is: 1.4.1 (found it using pip list)
Please tell me how to find the ClearMl server version

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

In the web UI you can click the settings icon at the top right -> settings. At that screen the version should be shown at the bottom right

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

WebApp: 1.5.0-192 • Server: 1.5.0-192 • API: 2.18

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

From what I understand, by default the ES has a low disk waterkmark set at 95% of the disk capacity. Once reached the shard is transitioned to a read only mode. Since you have a large disk of 1.8Tb the remaining 85Gb is below the 5%.

Basically you need to set the following env vars in elasticsearch service in the docker compose:
- cluster.routing.allocation.disk.watermark.low=10gb - cluster.routing.allocation.disk.watermark.high=10gb - cluster.routing.allocation.disk.watermark.flood_stage=10gb

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I’m trying to find the compose file,
Any suggestions ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

docker-compose.yml file you used to set up the server

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I’m not sure where it is,
I tried to run the following:
docker container inspect clearml-elastic | grep compose "com.docker.compose.config-hash": "5563c3436de5f0c5e3349da7f0511ea7a8020ce4ce5dc22d161f0561e973fecd", "com.docker.compose.container-number": "1", "com.docker.compose.oneoff": "False", "com.docker.compose.project": "clearml", "com.docker.compose.service": "elasticsearch", "com.docker.compose.version": "1.24.1",The compose file should be inside the container ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

How did you install your clearml server?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Not sure - I’m not the one who installed it.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

I found it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

But the values are smaller then 10 gb (it is 500 mb):
cluster.routing.allocation.disk.watermark.low: 500mb cluster.routing.allocation.disk.watermark.high: 500mb cluster.routing.allocation.disk.watermark.flood_stage: 500mb

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

What happens if you use the settings I pasted?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I tried it (including docker down and up)
Currently giving the same errors as follows:

2022-08-15 16:15:19,676 - clearml.Task - ERROR - Action failed <500/100: tasks.reset/v1.0 (General data error (TransportError(503, 'search_phase_execution_exception')))> (force=False, clear_all=False, task=f535ef2c45cf4baaaf13c8f2fe2ac19a)
2022-08-15 16:16:19,845 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039 400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '7ddf4ca9846040dabc5814b83d0935a6', 'status': 503, 'error': {'type':..., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], reques t: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][_doc][7ddf4ca9846040dabc5814b83d0935a6], source[_na_]}] and a refresh])>)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Hi RattyFish27 , it seems that there is some issue with Elasticsearch cluster. Can you please run the following commands on the server and paste here their output?
curl -XGET curl -XGET

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Sure I will do it tomorrow
Thanks for the help in the meantime

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Hi Evgeny,
I tried to run the curl command, it gave me the following error:
curl: (7) Failed to connect to localhost port 9200: Connection refused

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Are you running them on the computer that hosts the server docker containers. What is the port binding for elasticsearch in your docker compose?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Yes I run the command from the server that host the container.
In the environment section I have:
CLEARML_ELASTIC_SERVICE_PORT: 9200
In the apiserver section I have the following:
ports: - "8008:8008"
In the fileserver section I have :
ports: - "8081:8081"

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Ok, it seems that elasticsearch ports are open for internal communication but not for the host. Can you please add the following section to elasticsearch service in docker compose and restart the dockers?
ports: - "9200:9200"After that the commands should work from host

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Done,
I got the following outputs:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .geoip_databases Mshu2LugQ2aQYmFTB8Lckg 1 0 44 74 76.8mb 76.8mb green open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 7qIRfRjNRke6GuHZzRKsuw 1 0 71382012 565576 8.3gb 8.3gb red open events-log-d1bd92a3b039400cbafc60a7a5b1e52b QamfEch8RTeSDibf25iIOw 1 0 green open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b ULSBPv_jRUqJfLhBWUonIA 1 0 5998 0 1.4mb 1.4mb
index shard prirep state docs store ip node .ds-ilm-history-5-2022.07.13-000002 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.08.08-000005 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.07.11-000003 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.07.25-000004 0 p STARTED 192.168.64.5 clearml .geoip_databases 0 p STARTED 41 109.7mb 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.27-000002 0 p STARTED 192.168.64.5 clearml events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 5998 1.4mb 192.168.64.5 clearml events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 71382012 8.3gb 192.168.64.5 clearml events-log-d1bd92a3b039400cbafc60a7a5b1e52b 0 p UNASSIGNED .ds-ilm-history-5-2022.06.13-000001 0 p STARTED 192.168.64.5 clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.13-000001 0 p STARTED 192.168.64.5 clearml .ds-ilm-history-5-2022.08.12-000003 0 p STARTED 192.168.64.5 clearml

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

It seems that index events-log-d1bd92a3b039400cbafc60a7a5b1e52b got corrupted. In case there are no backups the only choice would be to delete this index from elasticsearch

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Which means the experiments will be deleted right ?
And how should I delete the index ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Actually the task logs will be lost. The tasks themselves and their reported metrics and plots would stay. The command is the following:
curl -XDELETE localhost:9200/events-log-d1bd92a3b039400cbafc60a7a5b1e52b

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Thanks - it works :)
It happen for the second time in the last couple of months,
Do you have any suggestions regarding why it’s happening and how to make sure it won’t happen again ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

👍 I would say either deploying elasticsearch cluster consisting of several nodes with replication or doing daily backups:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/snapshot-restore.html
Apart from it is worth making sure that ES is running in a stable environment (no abrupt restarts) and with enough RAM.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Ok Thanks

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					RattyFish27
				
					0
					 × 1

Write your answer

946 Views

27 Answers

2 years ago

one year ago