Hello, We Are Getting Following Timeout Errors During The Task Run:

Answered

Hello,
We are getting following timeout errors during the task run:

2023-08-10 13:53:36,361 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>

In API container logs we see:

[2023-08-18 14:58:53,255] [8] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 241121ms, msg=General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)))

Is there a way to enlarge the timeout? If I understand correct, this is a read timeout from ClearML to Elasticsearch and it should be configured in ClearML client/server.

WebApp: 1.3.0-165 • Server: 1.3.0-165 • API: 2.17

  				
Posted 
	one year ago

					More
				  		
  Report
		
					LackadaisicalHedgehong78
				
					0
					 × 1

Votes Newest

Answers

It can be changed with this env var for the apiserver:
CLEARML__hosts__elastic__events__args__timeout=<new number>
Though the better handling could be either increase the elasticsearch capacity (memory and cpu) or decrease the load (send events in smaller batches)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

897 Views

1 Answer

one year ago