I Finally Got The Cleanup_Service.Py To Run. However, Now I Get Errors When Trying To Load Scalars. This Is What I Found In The Logs

Answered

I finally got the cleanup_service.py to run. However, now I get errors when trying to load scalars.
This is what I found in the logs
[ERROR] [clearml.service_repo] Returned 500 for events.scalar_metrics_iter_histogram in 43ms, msg=General data error (TransportError(503, 'search_phase_execution_exception'))😕
EDIT: When running the cleanup_service I get a lot of the following warnings:
WARNING:root:Could not delete Task ID=52af6f91081b41b9b6781eaad2a22d8b, General data error (TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [2059419100/1.9gb], which is larger than the limit of [1972122419/1.8gb], real usage: [2059418976/1.9gb], new bytes reserved: [124/124b], usages [request=0/0b, fielddata=106526/104kb, in_flight_requests=3055516/2.9mb, accounting=183194406/174.7mb]'))

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Votes Newest

Answers 30

1.9gb?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Seems to happen only while the cleanup_service is running!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

This might be caused by the service trying to delete too many tasks?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

So I just tried again, but with manual deleting via Web UI.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Seems like some experiments cannot be deleted

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

but you did see the data there after you upgraded the server, right?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And if you shut down the server and start it up again? still no data?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Perhaps more memory to the ES service will solve the issue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

8GB

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

[2021-05-07 10:52:00,282] [9] [WARNING] [elasticsearch] POST ` [status:N/A request:60.058s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
response.begin()
File "/usr/lib64/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib64/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
[2021-05-07 10:52:00,320] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
[2021-05-07 10:52:00,350] [9] [WARNING] [elasticsearch] POST [status:409 request:0.067s] `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

In the WebUI it just shows that an error happened after the loading bar has been running for a while.
I tried to delete the same tasks again and this time, it instantly confirmed deletion and the tasks are gone.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

What do you mean everything is gone?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ReassuredTiger98 What are the memory settings for Elasticsearch in your docker compose? If it is 2 Gb and you have enough memory on your server then you can try to increase it to 4gb like this: ES_JAVA_OPTS: -Xms4g -Xmx4g

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

But the problems seem to be reoccuring

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

No idea what's happening there.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Okay, I will increase it and try again.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I have no idea whether it is a user error or because of the clearml-server update...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

[2021-05-07 10:53:00,566] [9] [WARNING] [elasticsearch] POST ` [status:N/A request:60.061s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
response.begin()
File "/usr/lib64/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib64/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
[2021-05-07 10:53:00,621] [9] [WARNING] [elasticsearch] POST [status:409 request:0.054s] `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

2Gb

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

It's strange since your issues seem to be with the ES service, but the experiments themselves are stored in mongodb

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

How much memory did you use?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I am not sure what happened, but my experiments are gone. However, the data directory is still filled.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Hard to answer now. I just wiped everything and reinstalled. If I encounter this problem again, I will investigate further.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

What's ES?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

No problem in my case at least.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I got the error again. Seems to happen only when I try to delete "large" experiments.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I guess it started with the usage of the cleanup_service.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

When I select many experiments it will only delete some and show an error message, that some could not be deleted. But if I only select a few, everything works fine.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Now for some reason everything is gone .. 😕

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Write your answer

2K Views

30 Answers

4 years ago

2 years ago