No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore
@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.
I tested that theory before; I commented out these two lines
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
The issue, however, persisted.
On what host did you run the curl command?
While on the host you can run some ES commands to check the shards health and allocations. For example this:
curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
It may give more clues to the problem
I have been rerunning it since yesterday. The error persists.
I can try one more time though.
I don't store anything on clearml server; everything is being stored in S3 and referenced by ClearML.
This seems something different not connected to ES. Where do you get these logs?
Then possibly it is another reason. Need to search for in the ES logs
One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons
Also, (without CLearML) the model artifacts are uploaded/downloadable.
That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.
task = Task.init(
project_name="icp",
task_name=f"model_training_{client_name}",
task_type=Task.TaskTypes.training,
auto_connect_frameworks={'matplotlib': True, 'tensorflow': False,
'tensorboard': False,
'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
},
output_uri=False
)
task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)
task.connect_configuration()
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
....
task = Task.current_task()
if task is None:
print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
logger = None
else:
logger = task.get_logger()
logger.report_matplotlib_figure()
logger.report_scalar()
Can you provide a standalone code snippet that reproduces this behaviour?
It depends on your usage. ES has some default watermarks that are activated when the amount of used space is above 85% and 90% (can be overwritten) of the storage. At some point it may transfer the index to a "readonly" state.
I need to ssh the instance, right?
I'll check it out.
it's
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
fzd6tw0x46-algo-1-lswt4 | 2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4 | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4 | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urlli
Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?
Also, is this the entire error repeating or is there more context?
ok. Currently the ebs is 15 GB, is there a recommended size?
@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").
The status seemed to had transitioned, but the it's not clear the error.
{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
ClearML Task: created new task id=f08b012bce42420dba7cd166668f5e4b
2025-05-20 09:54:59,251 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: /projects/184c6e8651d94b9088ae60ae3a9c8ace/experiments/f08b012bce42420dba7cd166668f5e4b/output/log
2025-05-20 12:55:02
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Starting the training.
....
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-05-20 13:02:08
2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb6a0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:04:25
2025-05-20 10:04:25,347 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32bf0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32c20>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33040>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:06:48
2025-05-20 10:06:48,615 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32da0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33f40>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after
it's behaving very strangely.
I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.
Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker
What is the status that you get for the "events-log-d1bd92a3b039400cbafc60a7a5b1e52b" index?
And are you still getting exactly this error?
<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>
So you watered it down to these lines?
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
This is what causes the timeout errors? Did you define
green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b Yh4BPGmgRZKU7STdCghmtw 1 0 96 0 175.1kb 175.1kb 175.1kb