So you watered it down to these lines?
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
This is what causes the timeout errors? Did you define
no, it's something else.
I commented out the above two line and I was still facing the issue.
What is the status that you get for the "events-log-d1bd92a3b039400cbafc60a7a5b1e52b" index?
No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore
Can you provide a standalone code snippet that reproduces this behaviour?
And are you still getting exactly this error?
<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>
@<1722061389024989184:profile|ResponsiveKoala38> It's not resolved.
so, the same ClearML monitor error, but another issue now.
btw, the task logs the configuration, artifacts, etc.
I get this error at the end.
to close this thread, file server port wasn't configured
I added
- IpProtocol: tcp
FromPort: 8081
ToPort: 8081
CidrIp: 0.0.0.0/0
to cloudformation template, and it was resolved.
Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>
I'm begining to think that there is something besides ClearML. I'll execute the training script on remote (SageMaker), instead of SageMaker local mode.
I need to ssh the instance, right?
I'll check it out.
ok. Currently the ebs is 15 GB, is there a recommended size?
@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").
The status seemed to had transitioned, but the it's not clear the error.
{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
ClearML Task: created new task id=f08b012bce42420dba7cd166668f5e4b
2025-05-20 09:54:59,251 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: /projects/184c6e8651d94b9088ae60ae3a9c8ace/experiments/f08b012bce42420dba7cd166668f5e4b/output/log
2025-05-20 12:55:02
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Starting the training.
....
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-05-20 13:02:08
2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to ec2-13-217-109-164.compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb6a0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:04:25
2025-05-20 10:04:25,347 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32bf0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32c20>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:04:25,348 - urllib3.connectionpool - WARNING - Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33040>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 13:06:48
2025-05-20 10:06:48,615 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c32da0>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683c33f40>, 'Connection to .compute-1.amazonaws.com timed out. (connect timeout=300.0)')': /
2025-05-20 10:06:48,616 - urllib3.connectionpool - WARNING - Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See
for more information."
}
],
"type" : "illegal_argument_exception",
"reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See
for more information."
},
"status" : 400
}
this means that elasticsearch server hasn't started, right?
Also, it would be great if you could add a recommendation for EBS size in this guide ( None ),
The Elastic Search issue happened with 8 GB, and was resolved with 15 GB.
Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker
curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
curl: (7) Failed to connect to localhost port 9200 after 0 ms: Couldn't connect to server
I don't store anything on clearml server; everything is being stored in S3 and referenced by ClearML.
While on the host you can run some ES commands to check the shards health and allocations. For example this:
curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
It may give more clues to the problem
One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons
Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?
Also, is this the entire error repeating or is there more context?
Hi @<1835488771542355968:profile|PerplexedShells66> , please inspect your Elasticsearch logs. Any errors or warnings there?
I tried deleting all the underlying resources: ec2 & ebs, and recreating it again.
That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.
task = Task.init(
project_name="icp",
task_name=f"model_training_{client_name}",
task_type=Task.TaskTypes.training,
auto_connect_frameworks={'matplotlib': True, 'tensorflow': False,
'tensorboard': False,
'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
},
output_uri=False
)
task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)
task.connect_configuration()
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
....
task = Task.current_task()
if task is None:
print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
logger = None
else:
logger = task.get_logger()
logger.report_matplotlib_figure()
logger.report_scalar()