One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons
Also, (without CLearML) the model artifacts are uploaded/downloadable.
ok. Currently the ebs is 15 GB, is there a recommended size?
ClearML Task: created new task id=f08b012bce42420dba7cd166668f5e4b
2025-05-20 09:54:59,251 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: /projects/184c6e8651d94b9088ae60ae3a9c8ace/experiments/f08b012bce42420dba7cd166668f5e4b/output/log
2025-05-20 12:55:02
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Starting the training.
....
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-05-20 13:02:08
2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb6a0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:04:25
2025-05-20 10:04:25,347 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32bf0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32c20>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33040>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:06:48
2025-05-20 10:06:48,615 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32da0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33f40>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after
to close this thread, file server port wasn't configured
I added
- IpProtocol: tcp
FromPort: 8081
ToPort: 8081
CidrIp: 0.0.0.0/0
to cloudformation template, and it was resolved.
Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>
This seems something different not connected to ES. Where do you get these logs?
I do not see any issues in the log. Do you still get errors in the task due to the failure in events.add_batch?
No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore
On what host did you run the curl command?
@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.
it's behaving very strangely.
I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.
green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b Yh4BPGmgRZKU7STdCghmtw 1 0 96 0 175.1kb 175.1kb 175.1kb
Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker
Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?
Also, is this the entire error repeating or is there more context?
That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.
task = Task.init(
project_name="icp",
task_name=f"model_training_{client_name}",
task_type=Task.TaskTypes.training,
auto_connect_frameworks={'matplotlib': True, 'tensorflow': False,
'tensorboard': False,
'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
},
output_uri=False
)
task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)
task.connect_configuration()
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
....
task = Task.current_task()
if task is None:
print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
logger = None
else:
logger = task.get_logger()
logger.report_matplotlib_figure()
logger.report_scalar()
And are you still getting exactly this error?
<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>
While on the host you can run some ES commands to check the shards health and allocations. For example this:
curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
It may give more clues to the problem
Then possibly it is another reason. Need to search for in the ES logs