Anyone Faced An Issue With Elasticsearch Before

Answered

Anyone faced an issue with elasticsearch before

h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>)

8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:48:20,538 - urllib3.connectionpool - WARNING - Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x794c188c8460>, 'Connection to "" timed out. (connect timeout=10.0)')': /v2.23/events.add_batch

I'm self-hosting clearml server on ec2 (t3.large).

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Votes Newest

Answers 51

Also, it would be great if you could add a recommendation for EBS size in this guide ( None ),
The Elastic Search issue happened with 8 GB, and was resolved with 15 GB.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

to close this thread, file server port wasn't configured
I added

        - IpProtocol: tcp
          FromPort: 8081
          ToPort: 8081
          CidrIp: 0.0.0.0/0

to cloudformation template, and it was resolved.

Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.


task = Task.init(
    project_name="icp",
    task_name=f"model_training_{client_name}",
    task_type=Task.TaskTypes.training,
    auto_connect_frameworks={'matplotlib': True, 'tensorflow': False, 
                             'tensorboard': False,
                            'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
                            'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
                            'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
    },
    output_uri=False
)

task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)

task.connect_configuration()

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

....

task = Task.current_task()
if task is None:
    print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
    logger = None
else:
    logger = task.get_logger()

logger.report_matplotlib_figure()
logger.report_scalar()

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Can you provide a standalone code snippet that reproduces this behaviour?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

no, it's something else.

I commented out the above two line and I was still facing the issue.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

So you watered it down to these lines?

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

This is what causes the timeout errors? Did you define

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Also, (without CLearML) the model artifacts are uploaded/downloadable.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>
it's ClearML, I commented out clearml lines, and it ran successfully!

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I'm begining to think that there is something besides ClearML. I'll execute the training script on remote (SageMaker), instead of SageMaker local mode.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I tested that theory before; I commented out these two lines

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

The issue, however, persisted.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I assume that ec2-13-217-109-164.compute-1.amazonaws.com is the ec2 instance where the API is running?
Are you using the files server or S3 for storage? Can you verify on the storage itself that the artifacts are actually uploaded and are downloadable?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

The.... are model-specific logs.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

ClearML Task: created new task id=f08b012bce42420dba7cd166668f5e4b
2025-05-20 09:54:59,251 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: /projects/184c6e8651d94b9088ae60ae3a9c8ace/experiments/f08b012bce42420dba7cd166668f5e4b/output/log
2025-05-20 12:55:02
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Starting the training.

....

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-05-20 13:02:08
2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb6a0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:04:25
2025-05-20 10:04:25,347 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32bf0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32c20>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33040>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:06:48
2025-05-20 10:06:48,615 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32da0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33f40>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Can you provide the full log?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

console (client).

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

This seems something different not connected to ES. Where do you get these logs?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

so, the same ClearML monitor error, but another issue now.

btw, the task logs the configuration, artifacts, etc.
I get this error at the end.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

it's

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urlli

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

And are you still getting exactly this error?

<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b            Yh4BPGmgRZKU7STdCghmtw 1 0   96 0 175.1kb 175.1kb 175.1kb

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

What is the status that you get for the "events-log-d1bd92a3b039400cbafc60a7a5b1e52b" index?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

In ES container please run "curl -XGET localhost:9200/_cat/indices"

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

@<1722061389024989184:profile|ResponsiveKoala38> It's not resolved.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Alright, it's running...

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

I would try it again.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

I have been rerunning it since yesterday. The error persists.

I can try one more time though.

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Did you try to run your job again?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					ResponsiveKoala38
				
					0

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See

 for more information."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See

 for more information."
  },
  "status" : 400
}

this means that elasticsearch server hasn't started, right?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					PerplexedShells66
				
					0
					 × 1

Show more results

Write your answer

21K Views

51 Answers

5 months ago