Anyone Faced An Issue With Elasticsearch Before

Answered

Anyone faced an issue with elasticsearch before

h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>)

8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:48:20,538 - urllib3.connectionpool - WARNING - Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x794c188c8460>, 'Connection to "" timed out. (connect timeout=10.0)')': /v2.23/events.add_batch

I'm self-hosting clearml server on ec2 (t3.large).

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Votes Newest

Answers 51

Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

ecc2, after I ssh-ed the instance.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

On what host did you run the curl command?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I do not see any issues in the log. Do you still get errors in the task due to the failure in events.add_batch?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
curl: (7) Failed to connect to localhost port 9200 after 0 ms: Couldn't connect to server

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").

The status seemed to had transitioned, but the it's not clear the error.

{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I tried deleting all the underlying resources: ec2 & ebs, and recreating it again.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

it's behaving very strangely.

I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

ok, I'm recreating the ec2 isntance to generate ssh key pair, then I'll check the elasticsearch logs.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

While on the host you can run some ES commands to check the shards health and allocations. For example this:

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"

It may give more clues to the problem

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Then possibly it is another reason. Need to search for in the ES logs

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I don't store anything on clearml server; everything is being stored in S3 and referenced by ClearML.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

It depends on your usage. ES has some default watermarks that are activated when the amount of used space is above 85% and 90% (can be overwritten) of the storage. At some point it may transfer the index to a "readonly" state.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

ok. Currently the ebs is 15 GB, is there a recommended size?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Yes

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I need to ssh the instance, right?
I'll check it out.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Hi @<1835488771542355968:profile|PerplexedShells66> , please inspect your Elasticsearch logs. Any errors or warnings there?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

It's the entire error repeating.
And, this happens at the end of the script.

I'm using the recommended instance (t3.large).

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?

Also, is this the entire error repeating or is there more context?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Show more results

Write your answer

13K Views

51 Answers

3 months ago