Hey :wave: *Tensorboard Logs Overwhelming Elasticsearch* I am running a clear ml server, however when running experiments with tensorboard logging I am seeing the elastic indexing time increase drastically and in some cases I have also seen timeout erro

Answered

Hey 👋

Tensorboard Logs Overwhelming Elasticsearch

I am running a clear ml server, however when running experiments with tensorboard logging I am seeing the elastic indexing time increase drastically and in some cases I have also seen timeout errors on the tasks trying to upload metrics.

2023-12-05 23:39:35,585 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>

In one case I was able to resolve this by turning off tensorboard logging by setting auto_connect_frameworks={"tensorboard": False} on Task.init. Ideally I would like to have the tb logs on.

Has anyone else had this issue?
Thanks for any help 🙂

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HarebrainedToad56
				
					0
					 × 1

Votes Newest

Answers 6

Hi @<1590152178218045440:profile|HarebrainedToad56>
Yes you are correct all TB logs are stored into the ELK in the clearml backend. This really scales well and rarely has issues, as long of course that the clearml-server is running on strong enough machine. How many RAM / HD you have on the clearml-server ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

... training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.

oh that makes sense, so basically you were bombarding the server with requests, and ending with kind of denial of service

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yep, almost a self DDoS

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HarebrainedToad56
				
					0
					 × 1

For an update 🙂
I think we identified that when moving from a training to fine tuning dataset (which was 1/1000th the size) our training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.

Since modifying this to be less frequent we have seen the index latency drop dramatically

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HarebrainedToad56
				
					0
					 × 1

Currently running it on a t3.xlarge which has 4CPU's, 16GB RAM and 300GB SSD

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HarebrainedToad56
				
					0
					 × 1

How much free RAM / disk do you have there now? How's the CPU utilization ? how many Tasks are working with this machine at the same time

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

6 Answers

one year ago