I am not sure what happened, but my experiments are gone. However, the data directory is still filled.
In the WebUI it just shows that an error happened after the loading bar has been running for a while.
I tried to delete the same tasks again and this time, it instantly confirmed deletion and the tasks are gone.
Now for some reason everything is gone .. 😕
ReassuredTiger98 What are the memory settings for Elasticsearch in your docker compose? If it is 2 Gb and you have enough memory on your server then you can try to increase it to 4gb like this: ES_JAVA_OPTS: -Xms4g -Xmx4g
I have no idea whether it is a user error or because of the clearml-server update...
It's strange since your issues seem to be with the ES service, but the experiments themselves are stored in mongodb
I guess it started with the usage of the cleanup_service.
Hard to answer now. I just wiped everything and reinstalled. If I encounter this problem again, I will investigate further.
Seems to happen only while the cleanup_service is running!
I got the error again. Seems to happen only when I try to delete "large" experiments.
And if you shut down the server and start it up again? still no data?
[2021-05-07 10:53:00,566] [9] [WARNING] [elasticsearch] POST
` [status:N/A request:60.061s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
response.begin()
File "/usr/lib64/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib64/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
[2021-05-07 10:53:00,621] [9] [WARNING] [elasticsearch] POST [status:409 request:0.054s] `
but you did see the data there after you upgraded the server, right?
So I just tried again, but with manual deleting via Web UI.
This might be caused by the service trying to delete too many tasks?
Perhaps more memory to the ES service will solve the issue?
Okay, I will increase it and try again.
[2021-05-07 10:52:00,282] [9] [WARNING] [elasticsearch] POST
` [status:N/A request:60.058s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
response.begin()
File "/usr/lib64/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib64/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
[2021-05-07 10:52:00,320] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 2ms
[2021-05-07 10:52:00,350] [9] [WARNING] [elasticsearch] POST [status:409 request:0.067s] `
When I select many experiments it will only delete some and show an error message, that some could not be deleted. But if I only select a few, everything works fine.
Seems like some experiments cannot be deleted