Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Again, My Clearml Api-Server Is Having A Memory Leak. Each Time I Restart It, Its Ram Consumption Grows Until Getting Oom, Is Not Killed And Make The Ec2 Instance Crash

Hi again, my clearml api-server is having a memory leak. Each time I restart it, its ram consumption grows until getting OOM, is not killed and make the ec2 instance crash

  
  
Posted 3 years ago
Votes Newest

Answers 30


however 504 is very extreme, I'm not sure it's related to the timeout on the server side, you might want to increase the ELB timeout

  
  
Posted 3 years ago

more than 120s?

  
  
Posted 3 years ago

well I still see some ES errors in the logs
clearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [500] requests and a refresh]

  
  
Posted 3 years ago

but not as much as the ELB reports

  
  
Posted 3 years ago

So it looks like it tries to register a batch of 500 documents

  
  
Posted 3 years ago

Maybe the agent could be adapted to have a max_batch_size parameter?

  
  
Posted 3 years ago

500 is relatively low...

  
  
Posted 3 years ago

but the issue was the shard not being active, it's not the number of documents

  
  
Posted 3 years ago

SuccessfulKoala55 I deleted all :monitor:machine and :monitor:gpu series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz . I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?

  
  
Posted 3 years ago

Yeah, should be this:
GET /_search { "aggs": { "tasks": { "terms": { "field": "task" } } } }See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

  
  
Posted 3 years ago

You might need to specify number of buckets if you don't get all of the experiments, but since it's a single shard, I think it'll be ordered by descending bucket size anyway

  
  
Posted 3 years ago

Thanks a lot, I will play with that!

  
  
Posted 3 years ago

SuccessfulKoala55 Thanks to that I was able to identify the most expensive experiments. How can I count the number of documents for a specific series? Ie. I suspect that the loss, that is logged every iteration, is responsible for most of the documents logged, and I want to make sure of that

  
  
Posted 3 years ago

Just do a sub-aggregation for the metric field (and if you like more details, a sub-sub aggregation for the variant field)

  
  
Posted 3 years ago

Something like that?
curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "variant": "loss_model" } }, { "match": { "task": "8f88e4b8cff84f23bde74ed4b7213ec6" } } ] } }, "aggs": { "series": { "terms": { "field": "iter" } } } } '

  
  
Posted 3 years ago

Here I have to do it for each task, is there a way to do it for all tasks at once?

  
  
Posted 3 years ago

Why do you do aggs on the "iter" field?

  
  
Posted 3 years ago

Why not do:
curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "match": { "variant": "loss_model" } } ] } }, "aggs": { "terms": { "field": "task" } } }For all tasks?

  
  
Posted 3 years ago

Ha nice, good one! Thanks!

  
  
Posted 3 years ago

Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?

  
  
Posted 3 years ago

Hmm, that's something I don't know ๐Ÿ™‚

  
  
Posted 3 years ago

I guess using a delete by query with a match on the field value suffix or something similar?

  
  
Posted 3 years ago

From what I can find there's a prefix query, but not a suffix - this can be done using a regex or a wildcard, but that's relatively expensive

  
  
Posted 3 years ago

"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"

  
  
Posted 3 years ago

Same for regexp, damn

  
  
Posted 3 years ago

There is no way to filter on long types? I canโ€™t believe it

  
  
Posted 3 years ago

This https://stackoverflow.com/questions/65109764/wildcard-search-issue-with-long-datatype-in-elasticsearch says long types can be converted to string to do the search

  
  
Posted 3 years ago

But I would need to reindex everything right? Is that a expensive operation?

  
  
Posted 3 years ago

Reindex is very expensive ๐Ÿ™‚

  
  
Posted 3 years ago

Ok, I guess Iโ€™ll just delete the whole loss series. Thanks!

  
  
Posted 3 years ago
1K Views
30 Answers
3 years ago
one year ago
Tags