Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey, We'Re Seeing A Lot Of Issues With Our Clearml Self-Hosted Server These Days; It Seems Like The Api Times Out While Talking To Elasticsearch:

Hey,
We're seeing a lot of issues with our ClearML self-hosted server these days; it seems like the API times out while talking to elasticsearch:
2022-10-22 09:13:27,520 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
2022-10-22 09:14:13,280 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>
2022-10-22 09:15:13,427 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (TransportError(429, 'es_rejected_execution_exception', 'rejected execution of coordinating operation [coordinating_and_primary_bytes=214687537, replica_bytes=0, all_bytes=214687537, coordinating_operation_bytes=92360, max_coordinating_and_primary_bytes=214748364]')))>
2022-10-22 09:17:29,776 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
2022-10-22 09:19:13,760 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>
2022-10-22 09:21:32,003 - clearml.log - WARNING - failed logging task to backend (2 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
2022-10-22 09:23:14,052 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>
2022-10-22 09:25:34,208 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
Any thought as to why or how we can fix it?

  
  
Posted one year ago
Votes Newest

Answers 9


We are running the latest version (WebApp: 1.7.0-232 • Server: 1.7.0-232 • API: 2.21).
When I run docker logs clearml-elastic I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]", "cluster.uuid": "vTBu4rY0QhWZV01xnswjDg", "node.id": "cX7aMqJ4SwCxqM7sYM-S9Q"}

But only INFO logs

  
  
Posted one year ago

CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us

  
  
Posted one year ago

Were there any changes to your Elastic or your server in the past few days?

  
  
Posted one year ago

No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.

  
  
Posted one year ago

Yeah, for this server (which is close to the minimum requirements or even at the minimum requirements), this is really a very heavy load

  
  
Posted one year ago

Hi GiganticMole91 , what version of ClearML server are you using?
Also, can you take a look inside the elastic container to see if there are any errors there?

  
  
Posted one year ago

GiganticMole91 how many experiments are you running concurrently? Are you reporting a lot of metrics/logs in each experiment?

  
  
Posted one year ago

Googling ES error: es_rejected_execution_exception and it seems that it is caused due to the excessive load on ES. Apparently the hardware cannot keep up with the pace that you're sending events batches. I would recommend working with smaller batches and checking whether the error goes away.

  
  
Posted one year ago

SuccessfulKoala55 At peak we’ve been running ~50 experiments simultaneously that have been somewhat generous in reported metrics, although not extreme. Our CML server is hosted on an Azure D2S_v3 VM (2 vCPU, 8 GB RAM, 3200 IOPS). Looks like we should probably upgrade especially the disk specs. (Taking another look at our VM metrics we reached 100% OS disk IOPS consumed a couple of times.)

  
  
Posted one year ago
631 Views
9 Answers
one year ago
one year ago
Tags