Getting errors in elastisearch when deleting tasks, get retunred "cant delete experiment"
hi, thanks for reaching out. Getting desperate here.
Yes, its self hosted
No, only currently running experiments are deleted (task itself is gone, but debug images and models are present in fileserver folder)
What I do see is some random elastisearch errors popping up from time to time
[2024-01-05 09:16:47,707] [9] [WARNING] [elasticsearch] POST
None [status:N/A request:60.064s]
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
Another thing I noticed that elastisearch folder has grown to gigantic size, is that normal? Can I clear it up somehow without problems?
Its 50GB currently
elastisearch also takes like 15GB of ram
Here are my clearml versions and elastisearch taking up 50GB
Regarding the missing tasks, this is the first I'm hearing of such an issue - is it possible you're somehow reusing the task for a ClearML Dataset, causing it to be marked as hidden?
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , can you attach the text logs of ES? It seems that for some reason all of your shards fail, I'm not sure why. Also, the size is usually a by product of the amount of data (and tasks) in the system, and depends on the amount of data you're storing (console logs, events, etc.).
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I'm guessing it's a self deployed server. What version are you on? Did you ever see any errors/issues in mongodb/elastic?
Do you mean that ALL experiments are being deleted from all projects?
What do you mean by reusing the task for clearml Dataset, got a code example?
We have multiple different projects with multiple people working on each project.
This is our most used code on dataset uploading
- is 50GB elastisearch normal? Have you seen it. elsewhere or are we doing something wrong, one thing I think is that we are probably logging too frequently
- Is it possible to somehow clean up this?
The incident happened last friday (5 january)
Im giving you logs from around that time
I also have noticed that this incident usually happens in the morning at around 6-7AM
Are there maybe some clearnup tasks or backups running on clearml server at those times?
@<1523701087100473344:profile|SuccessfulKoala55> Anything on this?
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , size can grow, of course, depending on your usage. 50GB is a lot, which is probably a good reason to clean up unused/old tasks. a 50GB index/shard on a single-node ES can certainly cause delays or slowness (depending on the machine processing power and memory).
we are cleaning, but there is a major problem
When deleting a task from web UI, nothing is deleted elsewhere
Debug images are not deleted, models are not deleted. And I suspect that scalars and logs are not deleted too
Im not sure why is that so
I see the debug images in fileserver folder
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , what server version are you using? In general, scalars and logs are always deleted. Files and models are deleted by the server in recent versions, and were deleted by the UI directly in previous versions - not sure what version you're using
from docker inspect I can see that allegorai/clearml uses:
"CLEARML_SERVER_VERSION=1.11.0",
"CLEARML_SERVER_BUILD=373"
Image hash:ed05631045c4237f59ad48f477e06dd72274ab67e70d2f9adc489431d1ce75d7
I do notice another strange thing
Agent-services is down because It has no API key to clearm
I would upgrade the server.
Regarding the agent, you do need to set it up with key and secret as part of the installation process
Is that supposted to be so? How to fix it?