Hey All, We Have A Self-Hosted Clearml Server, Today We Launched ~40 Workers To Run Training Jobs On Our Queue, However We Started Getting Errors With Elasticsearch Connection Pool Being Full, And Subsequent Timeouts And Failed Tasks. Each Training Task

Answered

Hey all,
We have a self-hosted clearml server, today we launched ~40 workers to run training jobs on our queue, however we started getting errors with elasticsearch connection pool being full, and subsequent timeouts and failed tasks.

Each training task is plotting a lot of scalar metrics each step. We noticed the servers disk usage was very high, especially the read. Although CPU and memory usage were up, they didn’t seem concerning.

I’m wondering if anyone has an understanding of how elasticsearch is being utilised in clearnl server, is each single scalar being reported in its own api call followed by its own elasticsearch transaction? Or are metrics collated and processed in batches?

We noticed in the error messages that the elasticseach connection pool is only 10? Is there anyway to increase this?

Thanks in advance 😊

  				
Posted 
	one year ago

					More  		
  Report
		
					StaleLeopard22
				
					0
					 × 1

Votes Newest

Answers 2

Additionally , I’d like to understand what is being stored in elasticsearch vs mongo, redis etc. from my understanding it is the metrics and console logs being stored in elastic?

I’m thinking the solution may be to reduce the amount of metrics logged by averaging them locally and only reporting them once every 60s or so?

Or is there a way to tune the config of elastic, allowing it to handle the high volume of requests

  				
Posted 
	one year ago

					More  		
  Report
		
					StaleLeopard22
				
					0
					 × 1

I'd like to understand this as well. I moved my data & model versioning to AWS S3. So, can I get rid of the fileserver? Can I use Cloudwatch to work with logs rather than (what I assume is being done by) Elasticsearch?

  				
Posted 
	one year ago

					More  		
  Report
		
					SlimyElephant79
				
					0
					 × 1

Write your answer

1K Views

2 Answers

one year ago