So I see two options:
Reducing the number of image reported (already in our plan) Make on big image per epoch
I call it like that:logger.clearml_logger.report_image( self.tag, f"{self.tag}_{iteration:0{pad}d}", epoch, image=image ) `` self.tag
is train
or valid
. iteration
is an int for the minibatch in the epoch
Issue open on the clearml-server github https://github.com/allegroai/clearml-server/issues/89 . Thanks for your help.
I have 6 plots with one or 2 metrics. But I have a lot of debug samples.
Hi SteadyFox10 , how many unique metrics and variants do you have in this task? We may be hitting some limit here
What do you use as title
and for the series
for each image?
You're generating a huge amount of variants ( series
) using the iteration number
Something like 100 epoch with a least more than 100 images par epoch reported.
We're planning to optimize the server code for these cases, but I would suggest using a more fixed set of title/series for your debug images
That's really hard to support using ES as it inflates the number of buckets in the aggregation used when trying to locate unique debug images
Thanks a lot I'll check how to do this correctly
I have made some changes in the codelogger.clearml_logger.report_image( self.tag, f"{self.tag}_{epoch:0{pad}d}", iteration=iteration, image=image ) `` epoch
range is 0-150 iteration
range is 0-100And the error is still thereGeneral data error (TransportError(503, 'search_phase_execution_exception', 'Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.'))
Could it be because the joint of the scalar graph + debug samples ?
I have 8 scalar graph:
2 :monitor:{gpu|machine}: with 15k iteration 2 training_{metrics|loss} with 15k iteration and the other between 150 and 40 iteration each
SuccessfulKoala55 did you have any other suggestion? did I do something wrong with my changes ?
it's a matter of scale for the query that retrieves the data, not related to the amount of data
This is a run I made with the changes, As you can see the iteration now go from 0-111 and in each of them I have image with the name train_{001|150}
Even simpler than a github, this code reproduce the issues I have.
SuccessfulKoala55 feel free to roast my errors.
I'll try to make a code that reproduce this behavior and post it on github is it fine ? that way you could check if I'm the problem (which is really likely) 😛
Can I still ask you to open a GitHub issue? stuff tends to get lost here, and I can't get to it today 😞
Reducing the number of image reported (already in our plan)
You don't actually need to reduce the number of images, just make sure the series
parameter is consistent, so basically you want to make sure that in every report (i.e. iteration in which you're reporting), you have a fixed set of title/series values