Hi, I Am Getting The Following Errors In The Experiments I Am Currently Running:

Answered

Hi, I am getting the following errors in the experiments I am currently running:
` 2021-06-25 17:11:47,911 - clearml.Metrics - ERROR - Action failed <504/0: events.add_batch (<html>

<head><title>504 Gateway Time-out</title></head> <body> <center><h1>504 Gateway Time-out</h1></center> </body> </html> )> `I haven’t changed anything from the server side, would you have any idea when can such error appear?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

ha sorry it’s actually the number of shards that increased

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

from 10 to 11

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

4 cpus, 8Gb

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Number of shards?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I cannot ssh into the machine

That's very strange - since the server runs in docker, I don't see how it can cause the EC2 instance to be unavailable - can you check the EC2 panel to see what might be the problem?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

In any case, restarting the instance without shutting down the server in an orderly fashion always has the risk of damaging the database storage (mongo/elastic etc.)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Seems like it just went unresponsive at some point

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Here is (left) the data disk (/opt/clearml) and right the OS disk

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

maybe a merge operation from ES

Might be

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

but according to the disks graphs, the OS disk is being used, but not the data disk

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Again, it's possible - it's also possible that it takes so much memory the system is relying heavily on the swap

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

There's a reason for the ES index max size 😞

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

There’s a reason for the ES index max size

Does ClearML enforce a max index size? what typically happens when that limit is reached?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

The open-source version doesn't enforce any max size - it just keeps indexing

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And the ES behavior always depends on the machine and memory - I've seen machines that could handle 100GB indices 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Would adding a ILM (index lifecycle management) be an appropriate solution?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

how would it interact with the clearml-server api service? would it be completely transparent?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Well, currently the open source clearml apiserver depends on the index names in order to retrieve information for experiments, so manipulating these indices externally won't work. What you can do is use an ES cluster and change the index mappings to more than one shard, which would split the index into multiple parts (the size limit is actually per shard, not index). This is all stuff addressed by the paid version, I think

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2 (instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that it’s not possible to change this value after the index creation, is it true?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Thanks for the hint, I’ll check the paid version, but I’d like first to understand how much efforts it would be to fix the current situation by myself 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Also what is the benefit of having by default index.number_of_shards = 1 for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

If I understood correctly, setting

index.number_of_shards = 2

(instead of 1) would create a second shard for the large index, splitting it into two shards? This

seems to say that it’s not possible to change this value after the index creation, is it true?

Well, as long as you're using a single node, it should indeed alleviate the shard disk size limit, but I'm not sure ES will handle that too well. In any case, you can't change that for existing indices, you can modify the mapping template and reindex the existing index (you'll need to index to another name, delete the original and create an alias to the original name as the new index can't be renamed...)

Also what is the benefit of having by default

index.number_of_shards = 1

for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?

Well, as long as you use a single node, multiple shards offer no scale improvement

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)

Ok thanks!

Well, as long as you use a single node, multiple shards offer no scale improvement

But you can move shards from one node to another much more easily than changing the number of shards of an index, this is an advantage

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

SuccessfulKoala55 I am looking for ways to free some space and I have the following questions:
Is there a way to break-down all the document to identify the biggest ones? Is there a way to delete several :monitor:gpu and :monitor:machine time series? Is there a way to downsample some time series (eg. loss)?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Is there a way to break-down all the document to identify the biggest ones?

In case of scalars, they're all roughly the same, it's only a matter of which task reported more, so an aggregation by task_id would help you in figuring out which tasks are more costly

Is there a way to delete several :monitor:gpu and :monitor:machine time series?

Yes, these contain specific metric and variant document fields (you can look at a single document to figure out what they are), so an ES _delete_by_query request can be used to remove all documents containing these scalars. Remember however, that _delete_by_query is performance-intensive, so it will probably take much more time than simply deleting documents.

Is there a way to downsample some time series (eg. loss)?

Well, in this context, down-sampling a specific time-series is either:
Removing specific documents from that series, OR Reading all series documents in a script, down-sampling in memory, writing new documents for the new values and deleting old documents (either by query or by ID)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

1K Views

30 Answers

3 years ago

one year ago