Ok, I guess Iโll just delete the whole loss series. Thanks!
Just do a sub-aggregation for the metric
field (and if you like more details, a sub-sub aggregation for the variant
field)
Something like that?curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "variant": "loss_model" } }, { "match": { "task": "8f88e4b8cff84f23bde74ed4b7213ec6" } } ] } }, "aggs": { "series": { "terms": { "field": "iter" } } } } '
Here I have to do it for each task, is there a way to do it for all tasks at once?
Why not do:curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "variant": "loss_model" } } ] } }, "aggs": { "terms": { "field": "task" } } }
For all tasks?
From what I can find there's a prefix query, but not a suffix - this can be done using a regex or a wildcard, but that's relatively expensive
But I would need to reindex everything right? Is that a expensive operation?
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
well I still see some ES errors in the logsclearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [500] requests and a refresh]
Why do you do aggs
on the "iter" field?
however 504 is very extreme, I'm not sure it's related to the timeout on the server side, you might want to increase the ELB timeout
Hmm, that's something I don't know ๐
I guess using a delete by query with a match on the field value suffix or something similar?
Maybe the agent could be adapted to have a max_batch_size parameter?
Yeah, should be this:GET /_search { "aggs": { "tasks": { "terms": { "field": "task" } } } }
See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
There is no way to filter on long types? I canโt believe it
So it looks like it tries to register a batch of 500 documents
Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
SuccessfulKoala55 Thanks to that I was able to identify the most expensive experiments. How can I count the number of documents for a specific series? Ie. I suspect that the loss, that is logged every iteration, is responsible for most of the documents logged, and I want to make sure of that
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
You might need to specify number of buckets if you don't get all of the experiments, but since it's a single shard, I think it'll be ordered by descending bucket size anyway
but the issue was the shard not being active, it's not the number of documents
This https://stackoverflow.com/questions/65109764/wildcard-search-issue-with-long-datatype-in-elasticsearch says long types can be converted to string to do the search