While on the host you can run some ES commands to check the shards health and allocations. For example this:
curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
It may give more clues to the problem
I tested that theory before; I commented out these two lines
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
The issue, however, persisted.
green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b Yh4BPGmgRZKU7STdCghmtw 1 0 96 0 175.1kb 175.1kb 175.1kb
This seems something different not connected to ES. Where do you get these logs?
@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.
On what host did you run the curl command?
I assume that ec2-13-217-109-164.compute-1.amazonaws.com is the ec2 instance where the API is running?
Are you using the files server or S3 for storage? Can you verify on the storage itself that the artifacts are actually uploaded and are downloadable?
Can you provide a standalone code snippet that reproduces this behaviour?
And are you still getting exactly this error?
<500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>
it's
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
fzd6tw0x46-algo-1-lswt4 | 2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4 | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4 | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urlli
no, it's something else.
I commented out the above two line and I was still facing the issue.
@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").
The status seemed to had transitioned, but the it's not clear the error.
{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO", "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.
task = Task.init(
project_name="icp",
task_name=f"model_training_{client_name}",
task_type=Task.TaskTypes.training,
auto_connect_frameworks={'matplotlib': True, 'tensorflow': False,
'tensorboard': False,
'pytorch': False, 'xgboost': False, 'scikit': False, 'fastai': False,
'lightgbm': False, 'hydra': True, 'detect_repository': True, 'tfdefines': False,
'joblib': False, 'megengine': False, 'catboost': False, 'gradio': False
},
output_uri=False
)
task.set_script(repository=repo_url, branch=branch_name, working_dir="./", commit=commit_id)
task.set_parameter("commit_id", commit_id)
task.connect_configuration()
output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)
....
task = Task.current_task()
if task is None:
print("Warning: No ClearML task found. Metrics will not be logged to ClearML.")
logger = None
else:
logger = task.get_logger()
logger.report_matplotlib_figure()
logger.report_scalar()
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>
it's ClearML, I commented out clearml lines, and it ran successfully!
I'm begining to think that there is something besides ClearML. I'll execute the training script on remote (SageMaker), instead of SageMaker local mode.
What is the status that you get for the "events-log-d1bd92a3b039400cbafc60a7a5b1e52b" index?
to close this thread, file server port wasn't configured
I added
- IpProtocol: tcp
FromPort: 8081
ToPort: 8081
CidrIp: 0.0.0.0/0
to cloudformation template, and it was resolved.
Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>
No, it says that it does not detect any problematic shards. Given that output and the absence of the errors in the logs I would expect that you will not get the error anymore
Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker
it's behaving very strangely.
I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.
ok, I'm recreating the ec2 isntance to generate ssh key pair, then I'll check the elasticsearch logs.
I have been rerunning it since yesterday. The error persists.
I can try one more time though.
Also, (without CLearML) the model artifacts are uploaded/downloadable.
so, the same ClearML monitor error, but another issue now.
btw, the task logs the configuration, artifacts, etc.
I get this error at the end.
I do not see any issues in the log. Do you still get errors in the task due to the failure in events.add_batch?
ok. Currently the ebs is 15 GB, is there a recommended size?
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See
for more information."
}
],
"type" : "illegal_argument_exception",
"reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See
for more information."
},
"status" : 400
}
this means that elasticsearch server hasn't started, right?