Anyone Faced An Issue With Elasticsearch Before

Answered

Anyone faced an issue with elasticsearch before

h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>)

8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:48:20,538 - urllib3.connectionpool - WARNING - Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x794c188c8460>, 'Connection to "" timed out. (connect timeout=10.0)')': /v2.23/events.add_batch

I'm self-hosting clearml server on ec2 (t3.large).

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Votes Newest

Answers 51

Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?

Also, is this the entire error repeating or is there more context?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

It's the entire error repeating.
And, this happens at the end of the script.

I'm using the recommended instance (t3.large).

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Hi @<1835488771542355968:profile|PerplexedShells66> , please inspect your Elasticsearch logs. Any errors or warnings there?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I need to ssh the instance, right?
I'll check it out.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Yes

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

ok. Currently the ebs is 15 GB, is there a recommended size?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

It depends on your usage. ES has some default watermarks that are activated when the amount of used space is above 85% and 90% (can be overwritten) of the storage. At some point it may transfer the index to a "readonly" state.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I don't store anything on clearml server; everything is being stored in S3 and referenced by ClearML.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Then possibly it is another reason. Need to search for in the ES logs

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

While on the host you can run some ES commands to check the shards health and allocations. For example this:

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"

It may give more clues to the problem

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

ok, I'm recreating the ec2 isntance to generate ssh key pair, then I'll check the elasticsearch logs.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

it's behaving very strangely.

I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I tried deleting all the underlying resources: ec2 & ebs, and recreating it again.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").

The status seemed to had transitioned, but the it's not clear the error.

{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
curl: (7) Failed to connect to localhost port 9200 after 0 ms: Couldn't connect to server

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I do not see any issues in the log. Do you still get errors in the task due to the failure in events.add_batch?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

On what host did you run the curl command?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

ecc2, after I ssh-ed the instance.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See

 for more information."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See

 for more information."
  },
  "status" : 400
}

this means that elasticsearch server hasn't started, right?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Did you try to run your job again?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I have been rerunning it since yesterday. The error persists.

I can try one more time though.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I would try it again.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Alright, it's running...

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

@<1722061389024989184:profile|ResponsiveKoala38> It's not resolved.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

In ES container please run "curl -XGET localhost:9200/_cat/indices"

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

What is the status that you get for the "events-log-d1bd92a3b039400cbafc60a7a5b1e52b" index?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Show more results

Write your answer

13K Views

51 Answers

3 months ago