Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I Was Working With Model Serving And Monitoring, And Wanted To Know About Monitoring Aspects/Usage In Serving. I Actually Wanted To Know About Exactly What All Queries Related To The Serving Can Be Done, Like What All Are Important Metric Mon

Hi everyone,
I was working with model serving and monitoring, and wanted to know about monitoring aspects/usage in serving.
I actually wanted to know about exactly what all queries related to the serving can be done, like what all are important metric monitoring queries w.r.t. the serving tasks that can be visualized and shown in grafana?
I saw the asteroid example (that was used to detect dataset drift) in the clearml-blogs repository.
Can anyone please help and explain some other useful metric/queries regarding serving a general ML model?

  
  
Posted one year ago
Votes Newest

Answers 18


Hi DashingAlligator35 , did you run some of the serving examples?

  
  
Posted one year ago

yeah, I ran the example given in the docs as well as the one given in their asteroid blog repo.

  
  
Posted one year ago

But it's not clear that what all other queries and metrics can/should be considered for serving tasks

  
  
Posted one year ago

You can add basically whatever you want using
clearml-serving metrics add ...
None

  
  
Posted one year ago

so, this allows us to define buckets for the histogram distribution, as given in the example docs for monitoring, but apart from that what exactly can we add? eg. I want to view feature value distribution over an interval, and baseline distribution of training and test set, how can I do with the cli tool, or do I need to make changes in the original serving code?

  
  
Posted one year ago

like what all are important metric monitoring queries w.r.t. the serving tasks that can be visualized and shown in grafana?

Basically latency amd requests per minute are automatically reported. Additional reports are based on your RestAPI in/out.
Imagine the following restapi request json payload

{x=123, y=456}

and a return json of

{z=789}

The metrics you can add to the monitoring are the keys on both these jsons, i.e. "x", "y", "z"
These metrics can be both logged as plain values (i.e. time series values, scalars) or as histograms over time (i.e. per 30sec window) the number of time x fell into a specific value-bucket.
Make sense ?

  
  
Posted one year ago

I understood this, but still I have few doubts. Like what would be the exact query given an endpoint, for requests per sec.
Also, for the example you gave, I got the query up and running for it. Let's say I want a query to get the feature value (x and y in your example) distribution over some duration of time, then what should be the query, I tried endpoint:x_bucket{"+inf"}[$duration]/endpoint:x_sum{"+inf"}[$duration] and some other variations, but couldn't get it right. Can you help?

  
  
Posted one year ago

Like what would be the exact query given an endpoint, for requests per sec.

You mean in Grafana ?

  
  
Posted one year ago

Ya grafana or Prometheus (promql query)

  
  
Posted one year ago

A few examples here:
None

Grafana model performance example:

    browse to 

    login with: admin/admin
    create a new dashboard
    select Prometheus as data source
    Add a query: 100 * increase(test_model_sklearn:_latency_bucket[1m]) / increase(test_model_sklearn:_latency_sum[1m])
    Change type to heatmap, and select on the right hand-side under "Data Format" select "Time series buckets"
    You now have the latency distribution, over time.
    Repeat the same process for x0, the query would be 100 * increase(test_model_sklearn:x0_bucket[1m]) / increase(test_model_sklearn:x0_sum[1m])

And docs:
None
None

  
  
Posted one year ago

Ok, I'll look into this.
Thanks.

  
  
Posted one year ago

Well, I read this, but it is same as what I had done before.
The query here gives percentage of input data in each bucket over a period of time.
But my previous ques and other query are still not figured out.

  
  
Posted one year ago

But my previous ques and other query are still not figured out.

What do you mean by "previous ques and other query" ?

  
  
Posted one year ago

the one where I asked about the query for feature value distribution over time that can be executed to be shown in prometheus and grafana with the metrics that are currently getting scraped by prometheus from clearml-statistics

  
  
Posted one year ago

feature value distribution over time

You mean how to create this chart? None

  
  
Posted one year ago

Yeah

  
  
Posted one year ago

These instructions should create the exact chart:
None
What am I missing ?

  
  
Posted one year ago

Agreed with your answer. I mistook the given example query in the tutorial as something else rather than the feature distribution over time.
My next question is that what can be the other relevant queries that we can visualize (in grafana), which will help in monitoring the served model and the end-user. So, I wanted the queries for that, like can we have a query for K-L divergence from the available metrics (that prometheus scraped from clearml-serving-statistics), and if yes, then what is the exact query for the same. Also, what query to write to get baseline input data distribution (not the one given by user as payload in their endpoint request, but the original dataset over which the model was trained).

  
  
Posted one year ago
1K Views
18 Answers
one year ago
one year ago
Tags