Anyone Faced An Issue With Elasticsearch Before

Answered

Anyone faced an issue with elasticsearch before

h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>)

8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:48:20,538 - urllib3.connectionpool - WARNING - Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x794c188c8460>, 'Connection to "" timed out. (connect timeout=10.0)')': /v2.23/events.add_batch

I'm self-hosting clearml server on ec2 (t3.large).

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Votes Newest

Answers 51

One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Also, (without CLearML) the model artifacts are uploaded/downloadable.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

ok. Currently the ebs is 15 GB, is there a recommended size?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

ClearML Task: created new task id=f08b012bce42420dba7cd166668f5e4b
2025-05-20 09:54:59,251 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: /projects/184c6e8651d94b9088ae60ae3a9c8ace/experiments/f08b012bce42420dba7cd166668f5e4b/output/log
2025-05-20 12:55:02
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Starting the training.

....

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-05-20 13:02:08
2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb6a0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:04:25
2025-05-20 10:04:25,347 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32bf0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32c20>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33040>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:06:48
2025-05-20 10:06:48,615 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32da0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33f40>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

to close this thread, file server port wasn't configured
I added

        - IpProtocol: tcp
          FromPort: 8081
          ToPort: 8081
          CidrIp: 0.0.0.0/0

to cloudformation template, and it was resolved.

Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

This seems something different not connected to ES. Where do you get these logs?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I do not see any issues in the log. Do you still get errors in the task due to the failure in events.add_batch?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

On what host did you run the curl command?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

it's behaving very strangely.

I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b            Yh4BPGmgRZKU7STdCghmtw 1 0   96 0 175.1kb 175.1kb 175.1kb

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?

Also, is this the entire error repeating or is there more context?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.


task = Task.init(
    project_name="icp",
    task_name=f"model_training_{client_name}",
    task_type=Task.TaskTypes.training,
    auto_connect_frameworks={'matplotlib': True, 'tensorflow': False, 
                             'tensorboard': False,
                            'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
                            'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
                            'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
    },
    output_uri=False
)

task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)

task.connect_configuration()

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

....

task = Task.current_task()
if task is None:
    print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
    logger = None
else:
    logger = task.get_logger()

logger.report_matplotlib_figure()
logger.report_scalar()

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

And are you still getting exactly this error?

<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Yes

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

console (client).

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

While on the host you can run some ES commands to check the shards health and allocations. For example this:

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"

It may give more clues to the problem

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Then possibly it is another reason. Need to search for in the ES logs

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

Show more results

Write your answer

13K Views

51 Answers

3 months ago