Hey, We'Re Seeing A Lot Of Issues With Our Clearml Self-Hosted Server These Days; It Seems Like The Api Times Out While Talking To Elasticsearch:

Answered

Hey,
We're seeing a lot of issues with our ClearML self-hosted server these days; it seems like the API times out while talking to elasticsearch:
2022-10-22 09:13:27,520 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
2022-10-22 09:14:13,280 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>
2022-10-22 09:15:13,427 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (TransportError(429, 'es_rejected_execution_exception', 'rejected execution of coordinating operation [coordinating_and_primary_bytes=214687537, replica_bytes=0, all_bytes=214687537, coordinating_operation_bytes=92360, max_coordinating_and_primary_bytes=214748364]')))>
2022-10-22 09:17:29,776 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
2022-10-22 09:19:13,760 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>
2022-10-22 09:21:32,003 - clearml.log - WARNING - failed logging task to backend (2 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
2022-10-22 09:23:14,052 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>
2022-10-22 09:25:34,208 - clearml.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
Any thought as to why or how we can fix it?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

Votes Newest

Answers 9

CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

SuccessfulKoala55 At peak we’ve been running ~50 experiments simultaneously that have been somewhat generous in reported metrics, although not extreme. Our CML server is hosted on an Azure D2S_v3 VM (2 vCPU, 8 GB RAM, 3200 IOPS). Looks like we should probably upgrade especially the disk specs. (Taking another look at our VM metrics we reached 100% OS disk IOPS consumed a couple of times.)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SoreHorse95
				
					0
					 × 1

Hi GiganticMole91 , what version of ClearML server are you using?
Also, can you take a look inside the elastic container to see if there are any errors there?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Googling ES error: es_rejected_execution_exception and it seems that it is caused due to the excessive load on ES. Apparently the hardware cannot keep up with the pace that you're sending events batches. I would recommend working with smaller batches and checking whether the error goes away.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

GiganticMole91 how many experiments are you running concurrently? Are you reporting a lot of metrics/logs in each experiment?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Were there any changes to your Elastic or your server in the past few days?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Yeah, for this server (which is close to the minimum requirements or even at the minimum requirements), this is really a very heavy load

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

We are running the latest version (WebApp: 1.7.0-232 • Server: 1.7.0-232 • API: 2.21).
When I run docker logs clearml-elastic I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]", "cluster.uuid": "vTBu4rY0QhWZV01xnswjDg", "node.id": "cX7aMqJ4SwCxqM7sYM-S9Q"}

But only INFO logs

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

Write your answer

2K Views

9 Answers

3 years ago

2 years ago