Hi, we are currently seeing the following error in our logs of the ClearML apiserver pod:
[2024-03-20 15:33:32,089] [8] [WARNING] [elasticsearch] POST
None [status:429 request:0.001s]
[2024-03-20 15:33:32,089] [8] [ERROR] [clearml.__init__] Failed processing worker status report
Traceback (most recent call last):
File "/opt/clearml/apiserver/bll/workers/__init__.py", line 153, in status_report
self.log_stats_to_es(
File "/opt/clearml/apiserver/bll/workers/__init__.py", line 557, in log_stats_to_es
es_res = elasticsearch.helpers.bulk(
self.es _client, actions)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 410, in bulk
for ok, item in streaming_bulk(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 329, in streaming_bulk
for data, (ok, info) in zip(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 256, in _process_bulk_chunk
for item in gen:
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 195, in _process_bulk_chunk_error
raise error
File "/usr/local/lib/python3.9/site-packages/elasticsearch/helpers/actions.py", line 240, in _process_bulk_chunk
resp = client.bulk(*args, body="\n".join(bulk_actions) + "\n", **kwargs)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/utils.py", line 347, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/client/__init__.py", line 472, in bulk
return self.transport.perform_request(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 466, in perform_request
raise e
File "/usr/local/lib/python3.9/site-packages/elasticsearch/transport.py", line 427, in perform_request
status, headers_response, data = connection.perform_request(
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 291, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/base.py", line 328, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.TransportError: TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [1057463944/1008.4mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1057460904/1008.4mb], new bytes reserved: [3040/2.9kb], usages [inflight_requests=3040/2.9kb, request=0/0b, fielddata=9261/9kb, eql_sequence=0/0b, model_inference=0/0b]')
[2024-03-20 15:33:32,090] [8] [ERROR] [clearml.service_repo] Returned 500 for workers.status_report in 5ms, msg=General data error (Failed processing worker status report): err=429
I am not sure what to read out of this message: Is ClearML attempting to do a http request with nearly a GB of data?
I suspect that it has to something with an agent machine we recently added as a worker to the ClearML server but I do not understand where the big amount of data should come from as we have no tasks in the queue and only had one task in the queue (which was processed successfully) with around 1 MB of data.