Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I'Ve Had This Bug Where Every Few Weeks All My Current Running Experiments Are Stopped And Then Deleted. This Has Now Happend Like 3-4 Times. I Dont Understand What Is Causing It. Model Files, Debug Images Are Saved In Fileserver Folder, But The Task Itse

I've had this bug where every few weeks all my current running experiments are stopped and then deleted. This has now happend like 3-4 times. I dont understand what is causing it. Model files, debug images are saved in fileserver folder, but the task itself is nowhere to be seen in web GUI and SKD. I recently even updated clearml to latest version thinking it was a bug. Has anything like this occured to other people?

  
  
Posted 11 months ago
Votes Newest

Answers 25


Here are my clearml versions and elastisearch taking up 50GB

  
  
Posted 11 months ago

from docker inspect I can see that allegorai/clearml uses:
"CLEARML_SERVER_VERSION=1.11.0",
"CLEARML_SERVER_BUILD=373"

Image hash:ed05631045c4237f59ad48f477e06dd72274ab67e70d2f9adc489431d1ce75d7

  
  
Posted 11 months ago

Regarding the missing tasks, this is the first I'm hearing of such an issue - is it possible you're somehow reusing the task for a ClearML Dataset, causing it to be marked as hidden?

  
  
Posted 11 months ago

image

  
  
Posted 11 months ago

I do notice another strange thing
Agent-services is down because It has no API key to clearm

  
  
Posted 11 months ago

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , size can grow, of course, depending on your usage. 50GB is a lot, which is probably a good reason to clean up unused/old tasks. a 50GB index/shard on a single-node ES can certainly cause delays or slowness (depending on the machine processing power and memory).

  
  
Posted 11 months ago

@<1523701087100473344:profile|SuccessfulKoala55> Anything on this?

  
  
Posted 11 months ago

elastisearch also takes like 15GB of ram

  
  
Posted 11 months ago

Elaseticsearch spazzing out
image

  
  
Posted 11 months ago

hi, thanks for reaching out. Getting desperate here.
Yes, its self hosted
No, only currently running experiments are deleted (task itself is gone, but debug images and models are present in fileserver folder)

What I do see is some random elastisearch errors popping up from time to time

[2024-01-05 09:16:47,707] [9] [WARNING] [elasticsearch] POST None [status:N/A request:60.064s]
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)

Another thing I noticed that elastisearch folder has grown to gigantic size, is that normal? Can I clear it up somehow without problems?
Its 50GB currently

  
  
Posted 11 months ago

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , can you attach the text logs of ES? It seems that for some reason all of your shards fail, I'm not sure why. Also, the size is usually a by product of the amount of data (and tasks) in the system, and depends on the amount of data you're storing (console logs, events, etc.).

  
  
Posted 11 months ago

  1. is 50GB elastisearch normal? Have you seen it. elsewhere or are we doing something wrong, one thing I think is that we are probably logging too frequently
  2. Is it possible to somehow clean up this?
  
  
Posted 11 months ago

I also have noticed that this incident usually happens in the morning at around 6-7AM
Are there maybe some clearnup tasks or backups running on clearml server at those times?

  
  
Posted 11 months ago

image

  
  
Posted 11 months ago

Is that supposted to be so? How to fix it?

  
  
Posted 11 months ago

The incident happened last friday (5 january)
Im giving you logs from around that time
image

  
  
Posted 11 months ago

Getting errors in elastisearch when deleting tasks, get retunred "cant delete experiment"
image

  
  
Posted 11 months ago

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , what server version are you using? In general, scalars and logs are always deleted. Files and models are deleted by the server in recent versions, and were deleted by the UI directly in previous versions - not sure what version you're using

  
  
Posted 11 months ago

I would upgrade the server.
Regarding the agent, you do need to set it up with key and secret as part of the installation process

  
  
Posted 11 months ago

image

  
  
Posted 11 months ago

I see the debug images in fileserver folder

  
  
Posted 11 months ago

after task is deleted

  
  
Posted 11 months ago

we are cleaning, but there is a major problem
When deleting a task from web UI, nothing is deleted elsewhere
Debug images are not deleted, models are not deleted. And I suspect that scalars and logs are not deleted too
Im not sure why is that so

  
  
Posted 11 months ago

What do you mean by reusing the task for clearml Dataset, got a code example?
We have multiple different projects with multiple people working on each project.
This is our most used code on dataset uploading
image

  
  
Posted 11 months ago

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I'm guessing it's a self deployed server. What version are you on? Did you ever see any errors/issues in mongodb/elastic?

Do you mean that ALL experiments are being deleted from all projects?

  
  
Posted 11 months ago
740 Views
25 Answers
11 months ago
11 months ago
Tags