Anyone Faced An Issue With Elasticsearch Before

Answered

Anyone faced an issue with elasticsearch before

h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>)

8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:48:20,538 - urllib3.connectionpool - WARNING - Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x794c188c8460>, 'Connection to "" timed out. (connect timeout=10.0)')': /v2.23/events.add_batch

I'm self-hosting clearml server on ec2 (t3.large).

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Votes Newest

Answers 51

ok. Currently the ebs is 15 GB, is there a recommended size?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Then possibly it is another reason. Need to search for in the ES logs

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I need to ssh the instance, right?
I'll check it out.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

On what host did you run the curl command?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

And are you still getting exactly this error?

<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

It's the entire error repeating.
And, this happens at the end of the script.

I'm using the recommended instance (t3.large).

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

to close this thread, file server port wasn't configured
I added

        - IpProtocol: tcp
          FromPort: 8081
          ToPort: 8081
          CidrIp: 0.0.0.0/0

to cloudformation template, and it was resolved.

Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
curl: (7) Failed to connect to localhost port 9200 after 0 ms: Couldn't connect to server

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I have been rerunning it since yesterday. The error persists.

I can try one more time though.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I tried deleting all the underlying resources: ec2 & ebs, and recreating it again.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").

The status seemed to had transitioned, but the it's not clear the error.

{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I'm begining to think that there is something besides ClearML. I'll execute the training script on remote (SageMaker), instead of SageMaker local mode.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

In ES container please run "curl -XGET localhost:9200/_cat/indices"

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

While on the host you can run some ES commands to check the shards health and allocations. For example this:

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"

It may give more clues to the problem

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Also, it would be great if you could add a recommendation for EBS size in this guide ( None ),
The Elastic Search issue happened with 8 GB, and was resolved with 15 GB.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

ecc2, after I ssh-ed the instance.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.


task = Task.init(
    project_name="icp",
    task_name=f"model_training_{client_name}",
    task_type=Task.TaskTypes.training,
    auto_connect_frameworks={'matplotlib': True, 'tensorflow': False, 
                             'tensorboard': False,
                            'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
                            'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
                            'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
    },
    output_uri=False
)

task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)

task.connect_configuration()

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

....

task = Task.current_task()
if task is None:
    print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
    logger = None
else:
    logger = task.get_logger()

logger.report_matplotlib_figure()
logger.report_scalar()

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Can you provide a standalone code snippet that reproduces this behaviour?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Show more results

Write your answer

21K Views

51 Answers

5 months ago