Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Anyone Faced An Issue With Elasticsearch Before

Anyone faced an issue with elasticsearch before

h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=1 document(s) failed to index., extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][f3abecd0f46f4bd289e0ac39662fd850], source[{"timestamp":1747654820464,"type":"log","task":"fd3d00d99d88427bbc576cba53db062d","level":"info","worker":"b1193fbdd662","msg":"Starting the training.\nClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring","model_event":false,"@timestamp":"2025-05-19T11:40:21.919Z","metric":"","variant":""}]}] and a refresh])>)

8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:48:20,538 - urllib3.connectionpool - WARNING - Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x794c188c8460>, 'Connection to "" timed out. (connect timeout=10.0)')': /v2.23/events.add_batch

I'm self-hosting clearml server on ec2 (t3.large).

  
  
Posted 5 months ago
Votes Newest

Answers 51


so, the same ClearML monitor error, but another issue now.

btw, the task logs the configuration, artifacts, etc.
I get this error at the end.

  
  
Posted 5 months ago

it's

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683cc9810>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,178 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urlli
  
  
Posted 5 months ago

no, it's something else.

I commented out the above two line and I was still facing the issue.

  
  
Posted 5 months ago

Yes

  
  
Posted 5 months ago

I tested that theory before; I commented out these two lines

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

The issue, however, persisted.

  
  
Posted 5 months ago

Probably the 9200 port is not mapped from the ES container in the docker compose
The easiest would be to perform "sudo docker exec -it clearml-elastic /bin/bash" and then run the curl command from inside the ES docker

  
  
Posted 5 months ago

So you watered it down to these lines?

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

This is what causes the timeout errors? Did you define

  
  
Posted 5 months ago

console (client).

  
  
Posted 5 months ago

This seems something different not connected to ES. Where do you get these logs?

  
  
Posted 5 months ago

Also, (without CLearML) the model artifacts are uploaded/downloadable.

  
  
Posted 5 months ago

I do not see any issues in the log. Do you still get errors in the task due to the failure in events.add_batch?

  
  
Posted 5 months ago

One of the most likely reasons for this issue would be insufficient free disk space for Elasticsearch. This may happen if less than 10% of free space is left on ES storage location. But there may be also other reasons

  
  
Posted 5 months ago

It depends on your usage. ES has some default watermarks that are activated when the amount of used space is above 85% and 90% (can be overwritten) of the storage. At some point it may transfer the index to a "readonly" state.

  
  
Posted 5 months ago

@<1722061389024989184:profile|ResponsiveKoala38> It's not resolved.

  
  
Posted 5 months ago

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See 
 for more information."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See 
 for more information."
  },
  "status" : 400
}

this means that elasticsearch server hasn't started, right?

  
  
Posted 5 months ago

it's behaving very strangely.

I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.
image

  
  
Posted 5 months ago

I assume that ec2-13-217-109-164.compute-1.amazonaws.com is the ec2 instance where the API is running?
Are you using the files server or S3 for storage? Can you verify on the storage itself that the artifacts are actually uploaded and are downloadable?

  
  
Posted 5 months ago

@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").

The status seemed to had transitioned, but the it's not clear the error.

{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.reservedstate.service.FileSettingsService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.068Z", "log.level": "INFO", "message":"Node [{clearml}{wEMvgjW3SUSt8Y8ls7aEyw}] is selected as the current health node.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][management][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:19.360Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[events-plot-][0]]]).","previous.health":"RED","reason":"shards started [[events-plot-][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.250Z", "log.level": "INFO", "message":"[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05] creating index, cause [auto(bulk api)], templates [queue_metrics], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T08:36:48.489Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]]).","previous.health":"YELLOW","reason":"shards started [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-05][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.852Z", "log.level": "INFO", "message":"[events-log-d1bd92a3b039400cbafc60a7a5b1e52b] creating index, cause [auto(bulk api)], templates [events_log], shards [1]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
{"@timestamp":"2025-05-20T09:25:56.964Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).","previous.health":"YELLOW","reason":"shards started [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][masterService#updateTask][T#11]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"lvIPB_h3RiWqbCvCA-1dbw","elasticsearch.node.id":"wEMvgjW3SUSt8Y8ls7aEyw","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml"}
  
  
Posted 5 months ago

I'm begining to think that there is something besides ClearML. I'll execute the training script on remote (SageMaker), instead of SageMaker local mode.

  
  
Posted 5 months ago

In ES container please run "curl -XGET localhost:9200/_cat/indices"

  
  
Posted 5 months ago

Can you provide the full log?

  
  
Posted 5 months ago

It's the entire error repeating.
And, this happens at the end of the script.

I'm using the recommended instance (t3.large).

  
  
Posted 5 months ago

to close this thread, file server port wasn't configured
I added

        - IpProtocol: tcp
          FromPort: 8081
          ToPort: 8081
          CidrIp: 0.0.0.0/0

to cloudformation template, and it was resolved.

Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 5 months ago

I tried deleting all the underlying resources: ec2 & ebs, and recreating it again.

  
  
Posted 5 months ago

What is the status that you get for the "events-log-d1bd92a3b039400cbafc60a7a5b1e52b" index?

  
  
Posted 5 months ago

Looks like elastic is failing to access a shard. Do you have visibility into machine utilization? How much RAM is elastic consuming?

Also, is this the entire error repeating or is there more context?

  
  
Posted 5 months ago

ok. Currently the ebs is 15 GB, is there a recommended size?

  
  
Posted 5 months ago

Then possibly it is another reason. Need to search for in the ES logs

  
  
Posted 5 months ago

s3

  
  
Posted 5 months ago

I need to ssh the instance, right?
I'll check it out.

  
  
Posted 5 months ago