Here are my clearml versions and elastisearch taking up 50GB
from docker inspect I can see that allegorai/clearml uses:
"CLEARML_SERVER_VERSION=1.11.0",
"CLEARML_SERVER_BUILD=373"
Image hash:ed05631045c4237f59ad48f477e06dd72274ab67e70d2f9adc489431d1ce75d7
Regarding the missing tasks, this is the first I'm hearing of such an issue - is it possible you're somehow reusing the task for a ClearML Dataset, causing it to be marked as hidden?
I do notice another strange thing
Agent-services is down because It has no API key to clearm
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , size can grow, of course, depending on your usage. 50GB is a lot, which is probably a good reason to clean up unused/old tasks. a 50GB index/shard on a single-node ES can certainly cause delays or slowness (depending on the machine processing power and memory).
@<1523701087100473344:profile|SuccessfulKoala55> Anything on this?
elastisearch also takes like 15GB of ram
hi, thanks for reaching out. Getting desperate here.
Yes, its self hosted
No, only currently running experiments are deleted (task itself is gone, but debug images and models are present in fileserver folder)
What I do see is some random elastisearch errors popping up from time to time
[2024-01-05 09:16:47,707] [9] [WARNING] [elasticsearch] POST
None [status:N/A request:60.064s]
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
Another thing I noticed that elastisearch folder has grown to gigantic size, is that normal? Can I clear it up somehow without problems?
Its 50GB currently
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , can you attach the text logs of ES? It seems that for some reason all of your shards fail, I'm not sure why. Also, the size is usually a by product of the amount of data (and tasks) in the system, and depends on the amount of data you're storing (console logs, events, etc.).
- is 50GB elastisearch normal? Have you seen it. elsewhere or are we doing something wrong, one thing I think is that we are probably logging too frequently
- Is it possible to somehow clean up this?
I also have noticed that this incident usually happens in the morning at around 6-7AM
Are there maybe some clearnup tasks or backups running on clearml server at those times?
Is that supposted to be so? How to fix it?
The incident happened last friday (5 january)
Im giving you logs from around that time
Getting errors in elastisearch when deleting tasks, get retunred "cant delete experiment"
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , what server version are you using? In general, scalars and logs are always deleted. Files and models are deleted by the server in recent versions, and were deleted by the UI directly in previous versions - not sure what version you're using
I would upgrade the server.
Regarding the agent, you do need to set it up with key and secret as part of the installation process
I see the debug images in fileserver folder
we are cleaning, but there is a major problem
When deleting a task from web UI, nothing is deleted elsewhere
Debug images are not deleted, models are not deleted. And I suspect that scalars and logs are not deleted too
Im not sure why is that so
What do you mean by reusing the task for clearml Dataset, got a code example?
We have multiple different projects with multiple people working on each project.
This is our most used code on dataset uploading
Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I'm guessing it's a self deployed server. What version are you on? Did you ever see any errors/issues in mongodb/elastic?
Do you mean that ALL experiments are being deleted from all projects?