Ok, I guess Iโll just delete the whole loss series. Thanks!
But I would need to reindex everything right? Is that a expensive operation?
This https://stackoverflow.com/questions/65109764/wildcard-search-issue-with-long-datatype-in-elasticsearch says long types can be converted to string to do the search
There is no way to filter on long types? I canโt believe it
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
From what I can find there's a prefix query, but not a suffix - this can be done using a regex or a wildcard, but that's relatively expensive
I guess using a delete by query with a match on the field value suffix or something similar?
Hmm, that's something I don't know ๐
Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
Why not do:curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "variant": "loss_model" } } ] } }, "aggs": { "terms": { "field": "task" } } }
For all tasks?
Why do you do aggs
on the "iter" field?
Here I have to do it for each task, is there a way to do it for all tasks at once?
Something like that?curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "variant": "loss_model" } }, { "match": { "task": "8f88e4b8cff84f23bde74ed4b7213ec6" } } ] } }, "aggs": { "series": { "terms": { "field": "iter" } } } } '
Just do a sub-aggregation for the metric
field (and if you like more details, a sub-sub aggregation for the variant
field)
SuccessfulKoala55 Thanks to that I was able to identify the most expensive experiments. How can I count the number of documents for a specific series? Ie. I suspect that the loss, that is logged every iteration, is responsible for most of the documents logged, and I want to make sure of that
You might need to specify number of buckets if you don't get all of the experiments, but since it's a single shard, I think it'll be ordered by descending bucket size anyway
Yeah, should be this:GET /_search { "aggs": { "tasks": { "terms": { "field": "task" } } } }
See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
but the issue was the shard not being active, it's not the number of documents
Maybe the agent could be adapted to have a max_batch_size parameter?
So it looks like it tries to register a batch of 500 documents
well I still see some ES errors in the logsclearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [500] requests and a refresh]
however 504 is very extreme, I'm not sure it's related to the timeout on the server side, you might want to increase the ELB timeout