Hi DashingAlligator35 , did you run some of the serving examples?
yeah, I ran the example given in the docs as well as the one given in their asteroid blog repo.
But it's not clear that what all other queries and metrics can/should be considered for serving tasks
You can add basically whatever you want usingclearml-serving metrics add ...
None
so, this allows us to define buckets for the histogram distribution, as given in the example docs for monitoring, but apart from that what exactly can we add? eg. I want to view feature value distribution over an interval, and baseline distribution of training and test set, how can I do with the cli tool, or do I need to make changes in the original serving code?
like what all are important metric monitoring queries w.r.t. the serving tasks that can be visualized and shown in grafana?
Basically latency amd requests per minute are automatically reported. Additional reports are based on your RestAPI in/out.
Imagine the following restapi request json payload
{x=123, y=456}
and a return json of
{z=789}
The metrics you can add to the monitoring are the keys on both these jsons, i.e. "x", "y", "z"
These metrics can be both logged as plain values (i.e. time series values, scalars) or as histograms over time (i.e. per 30sec window) the number of time x
fell into a specific value-bucket.
Make sense ?
I understood this, but still I have few doubts. Like what would be the exact query given an endpoint, for requests per sec.
Also, for the example you gave, I got the query up and running for it. Let's say I want a query to get the feature value (x and y in your example) distribution over some duration of time, then what should be the query, I tried endpoint:x_bucket{"+inf"}[$duration]/endpoint:x_sum{"+inf"}[$duration]
and some other variations, but couldn't get it right. Can you help?
Like what would be the exact query given an endpoint, for requests per sec.
You mean in Grafana ?
Ya grafana or Prometheus (promql query)
A few examples here:
None
Grafana model performance example:
browse to
login with: admin/admin
create a new dashboard
select Prometheus as data source
Add a query: 100 * increase(test_model_sklearn:_latency_bucket[1m]) / increase(test_model_sklearn:_latency_sum[1m])
Change type to heatmap, and select on the right hand-side under "Data Format" select "Time series buckets"
You now have the latency distribution, over time.
Repeat the same process for x0, the query would be 100 * increase(test_model_sklearn:x0_bucket[1m]) / increase(test_model_sklearn:x0_sum[1m])
Well, I read this, but it is same as what I had done before.
The query here gives percentage of input data in each bucket over a period of time.
But my previous ques and other query are still not figured out.
But my previous ques and other query are still not figured out.
What do you mean by "previous ques and other query" ?
the one where I asked about the query for feature value distribution over time that can be executed to be shown in prometheus and grafana with the metrics that are currently getting scraped by prometheus from clearml-statistics
feature value distribution over time
You mean how to create this chart? None
These instructions should create the exact chart:
None
What am I missing ?
Agreed with your answer. I mistook the given example query in the tutorial as something else rather than the feature distribution over time.
My next question is that what can be the other relevant queries that we can visualize (in grafana), which will help in monitoring the served model and the end-user. So, I wanted the queries for that, like can we have a query for K-L divergence from the available metrics (that prometheus scraped from clearml-serving-statistics), and if yes, then what is the exact query for the same. Also, what query to write to get baseline input data distribution (not the one given by user as payload in their endpoint request, but the original dataset over which the model was trained).