I'Ve Had This Bug Where Every Few Weeks All My Current Running Experiments Are Stopped And Then Deleted. This Has Now Happend Like 3-4 Times. I Dont Understand What Is Causing It. Model Files, Debug Images Are Saved In Fileserver Folder, But The Task Itse

Answered

I've had this bug where every few weeks all my current running experiments are stopped and then deleted. This has now happend like 3-4 times. I dont understand what is causing it. Model files, debug images are saved in fileserver folder, but the task itself is nowhere to be seen in web GUI and SKD. I recently even updated clearml to latest version thinking it was a bug. Has anything like this occured to other people?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Votes Newest

Answers 25

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , can you attach the text logs of ES? It seems that for some reason all of your shards fail, I'm not sure why. Also, the size is usually a by product of the amount of data (and tasks) in the system, and depends on the amount of data you're storing (console logs, events, etc.).

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , size can grow, of course, depending on your usage. 50GB is a lot, which is probably a good reason to clean up unused/old tasks. a 50GB index/shard on a single-node ES can certainly cause delays or slowness (depending on the machine processing power and memory).

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

hi, thanks for reaching out. Getting desperate here.
Yes, its self hosted
No, only currently running experiments are deleted (task itself is gone, but debug images and models are present in fileserver folder)

What I do see is some random elastisearch errors popping up from time to time

[2024-01-05 09:16:47,707] [9] [WARNING] [elasticsearch] POST None [status:N/A request:60.064s]
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)

Another thing I noticed that elastisearch folder has grown to gigantic size, is that normal? Can I clear it up somehow without problems?
Its 50GB currently

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

I do notice another strange thing
Agent-services is down because It has no API key to clearm

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , what server version are you using? In general, scalars and logs are always deleted. Files and models are deleted by the server in recent versions, and were deleted by the UI directly in previous versions - not sure what version you're using

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , I'm guessing it's a self deployed server. What version are you on? Did you ever see any errors/issues in mongodb/elastic?

Do you mean that ALL experiments are being deleted from all projects?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I see the debug images in fileserver folder

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Getting errors in elastisearch when deleting tasks, get retunred "cant delete experiment"

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

elastisearch also takes like 15GB of ram

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

I would upgrade the server.
Regarding the agent, you do need to set it up with key and secret as part of the installation process

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

after task is deleted

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

is 50GB elastisearch normal? Have you seen it. elsewhere or are we doing something wrong, one thing I think is that we are probably logging too frequently
Is it possible to somehow clean up this?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

@<1523701087100473344:profile|SuccessfulKoala55> Anything on this?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Regarding the missing tasks, this is the first I'm hearing of such an issue - is it possible you're somehow reusing the task for a ClearML Dataset, causing it to be marked as hidden?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

from docker inspect I can see that allegorai/clearml uses:
"CLEARML_SERVER_VERSION=1.11.0",
"CLEARML_SERVER_BUILD=373"

Image hash:ed05631045c4237f59ad48f477e06dd72274ab67e70d2f9adc489431d1ce75d7

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Elaseticsearch spazzing out

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Here are my clearml versions and elastisearch taking up 50GB

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

The incident happened last friday (5 january)
Im giving you logs from around that time

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

we are cleaning, but there is a major problem
When deleting a task from web UI, nothing is deleted elsewhere
Debug images are not deleted, models are not deleted. And I suspect that scalars and logs are not deleted too
Im not sure why is that so

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

What do you mean by reusing the task for clearml Dataset, got a code example?
We have multiple different projects with multiple people working on each project.
This is our most used code on dataset uploading

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

I also have noticed that this incident usually happens in the morning at around 6-7AM
Are there maybe some clearnup tasks or backups running on clearml server at those times?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Is that supposted to be so? How to fix it?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Write your answer

2K Views

25 Answers

one year ago