Hi Everyone, I Was Working With Model Serving And Monitoring, And Wanted To Know About Monitoring Aspects/Usage In Serving. I Actually Wanted To Know About Exactly What All Queries Related To The Serving Can Be Done, Like What All Are Important Metric Mon

Answered

Hi everyone,
I was working with model serving and monitoring, and wanted to know about monitoring aspects/usage in serving.
I actually wanted to know about exactly what all queries related to the serving can be done, like what all are important metric monitoring queries w.r.t. the serving tasks that can be visualized and shown in grafana?
I saw the asteroid example (that was used to detect dataset drift) in the clearml-blogs repository.
Can anyone please help and explain some other useful metric/queries regarding serving a general ML model?

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

Votes Newest

Answers 18

Hi DashingAlligator35 , did you run some of the serving examples?

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

yeah, I ran the example given in the docs as well as the one given in their asteroid blog repo.

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

But it's not clear that what all other queries and metrics can/should be considered for serving tasks

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

You can add basically whatever you want using
clearml-serving metrics add ...
None

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

so, this allows us to define buckets for the histogram distribution, as given in the example docs for monitoring, but apart from that what exactly can we add? eg. I want to view feature value distribution over an interval, and baseline distribution of training and test set, how can I do with the cli tool, or do I need to make changes in the original serving code?

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

like what all are important metric monitoring queries w.r.t. the serving tasks that can be visualized and shown in grafana?

Basically latency amd requests per minute are automatically reported. Additional reports are based on your RestAPI in/out.
Imagine the following restapi request json payload

{x=123, y=456}

and a return json of

{z=789}

The metrics you can add to the monitoring are the keys on both these jsons, i.e. "x", "y", "z"
These metrics can be both logged as plain values (i.e. time series values, scalars) or as histograms over time (i.e. per 30sec window) the number of time x fell into a specific value-bucket.
Make sense ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I understood this, but still I have few doubts. Like what would be the exact query given an endpoint, for requests per sec.
Also, for the example you gave, I got the query up and running for it. Let's say I want a query to get the feature value (x and y in your example) distribution over some duration of time, then what should be the query, I tried endpoint:x_bucket{"+inf"}[$duration]/endpoint:x_sum{"+inf"}[$duration] and some other variations, but couldn't get it right. Can you help?

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

Like what would be the exact query given an endpoint, for requests per sec.

You mean in Grafana ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ya grafana or Prometheus (promql query)

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

A few examples here:
None

Grafana model performance example:

    browse to


    login with: admin/admin
    create a new dashboard
    select Prometheus as data source
    Add a query: 100 * increase(test_model_sklearn:_latency_bucket[1m]) / increase(test_model_sklearn:_latency_sum[1m])
    Change type to heatmap, and select on the right hand-side under "Data Format" select "Time series buckets"
    You now have the latency distribution, over time.
    Repeat the same process for x0, the query would be 100 * increase(test_model_sklearn:x0_bucket[1m]) / increase(test_model_sklearn:x0_sum[1m])

And docs:
None
None

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok, I'll look into this.
Thanks.

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

Well, I read this, but it is same as what I had done before.
The query here gives percentage of input data in each bucket over a period of time.
But my previous ques and other query are still not figured out.

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

But my previous ques and other query are still not figured out.

What do you mean by "previous ques and other query" ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

the one where I asked about the query for feature value distribution over time that can be executed to be shown in prometheus and grafana with the metrics that are currently getting scraped by prometheus from clearml-statistics

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

feature value distribution over time

You mean how to create this chart? None

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yeah

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

These instructions should create the exact chart:
None
What am I missing ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Agreed with your answer. I mistook the given example query in the tutorial as something else rather than the feature distribution over time.
My next question is that what can be the other relevant queries that we can visualize (in grafana), which will help in monitoring the served model and the end-user. So, I wanted the queries for that, like can we have a query for K-L divergence from the available metrics (that prometheus scraped from clearml-serving-statistics), and if yes, then what is the exact query for the same. Also, what query to write to get baseline input data distribution (not the one given by user as payload in their endpoint request, but the original dataset over which the model was trained).

  				
Posted 
	one year ago

					More  		
  Report
		
					DashingAlligator35
				
					0
					 × 1

Write your answer

1K Views

18 Answers

one year ago