I Have A Question Regarding The Deletion Of Archived Experiments. Some Of Them Can'T Be Deleted And The Error Message Is

Answered

I have a question regarding the deletion of archived experiments. Some of them can't be deleted and the error message is
General data error (TransportError(503, 'search_phase_execution_exception', 'Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.'))As I understand this is an "error" on elasticsearch side. Did you have any clue how to remove the experiment ? without changing the search.max_buckets parameter on the elastic container if possible.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Votes Newest

Answers 30

Hi SteadyFox10 , how many unique metrics and variants do you have in this task? We may be hitting some limit here

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

I have 6 plots with one or 2 metrics. But I have a lot of debug samples.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Something like 100 epoch with a least more than 100 images par epoch reported.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

What do you use as title and for the series for each image?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I call it like that:
logger.clearml_logger.report_image( self.tag, f"{self.tag}_{iteration:0{pad}d}", epoch, image=image ) `` self.tag is train or valid . iteration is an int for the minibatch in the epoch

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

That's the issue...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

You're generating a huge amount of variants ( series ) using the iteration number

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

That's really hard to support using ES as it inflates the number of buckets in the aggregation used when trying to locate unique debug images

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

We're planning to optimize the server code for these cases, but I would suggest using a more fixed set of title/series for your debug images

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

So I see two options:
Reducing the number of image reported (already in our plan) Make on big image per epoch

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Reducing the number of image reported (already in our plan)

You don't actually need to reduce the number of images, just make sure the series parameter is consistent, so basically you want to make sure that in every report (i.e. iteration in which you're reporting), you have a fixed set of title/series values

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks a lot I'll check how to do this correctly

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Sure, let me know if I can help 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I have made some changes in the code
logger.clearml_logger.report_image( self.tag, f"{self.tag}_{epoch:0{pad}d}", iteration=iteration, image=image ) `` epoch range is 0-150 iteration range is 0-100And the error is still there
General data error (TransportError(503, 'search_phase_execution_exception', 'Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.'))Could it be because the joint of the scalar graph + debug samples ?
I have 8 scalar graph:
2 :monitor:{gpu|machine}: with 15k iteration 2 training_{metrics|loss} with 15k iteration and the other between 150 and 40 iteration each
SuccessfulKoala55 did you have any other suggestion? did I do something wrong with my changes ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

150 x 100 is still larger than 10,000

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

it's a matter of scale for the query that retrieves the data, not related to the amount of data

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Oh, sorry

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

iteration has nothing to do with it

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Are you using a fixed self.tag ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

yes tag is fixed

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

This is a run I made with the changes, As you can see the iteration now go from 0-111 and in each of them I have image with the name train_{001|150}

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

That's strange... 😕

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I'll try to make a code that reproduce this behavior and post it on github is it fine ? that way you could check if I'm the problem (which is really likely) 😛

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

I'd appreciate that 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Even simpler than a github, this code reproduce the issues I have.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

SuccessfulKoala55 feel free to roast my errors.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Can I still ask you to open a GitHub issue? stuff tends to get lost here, and I can't get to it today 😞

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Ok fine.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Is it better on clearml or clearml-server ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Issue open on the clearml-server github https://github.com/allegroai/clearml-server/issues/89 . Thanks for your help.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SteadyFox10
				
					0
					 × 1

Write your answer

1K Views

30 Answers

3 years ago

one year ago