Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey, My Name Is Ido, And I Am A New Clearml User. My Goal Is To Monitor The Accuracy Of My Llm Outputs In Production. I Understand That I Can Log Each Iteration With A Binary Output (0 For Incorrect And 1 For Correct), But This Approach Makes The Visual G

Hey, my name is Ido, and I am a new ClearML user.
My goal is to monitor the accuracy of my LLM outputs in production. I understand that I can log each iteration with a binary output (0 for incorrect and 1 for correct), but this approach makes the visual graph less readable.
Is there a way to aggregate the results, such as defining an iteration as the accuracy of 100 samples, to improve the readability of the visual graph?

In general, what are the best practices for monitoring LLMs using ClearML?
Thanks!

  
  
Posted 5 months ago
Votes Newest

Answers 5


@<1523701205467926528:profile|AgitatedDove14> Thanks! The only thing is that I prefer serving my models in-house and only performing the monitoring via ClearML. By the way, I saw there is a project dashboard app which might support the visualization I am looking for. Is it suitable for such use case?

  
  
Posted 5 months ago

I prefer serving my models in-house and only performing the monitoring via ClearML.

clearml-serving is an infrastructure for you to run models 🙂
to clarify, clearml-serving is running on your end (meaning this is not SaaS where a 3rd party is running the model)

By the way, I saw there is a project dashboard app which might support the visualization I am looking for. Is it suitable for such use case?

Hmm interesting, actually it might, it does collect matrices over time and averages them

  
  
Posted 5 months ago

Hi @<1523701205467926528:profile|AgitatedDove14> ,
I guess I can log the input-output pairs and report the average accuracy as a scalar. However, I'm not sure if this is the right way to monitor my data. Obviously, using iterations makes sense when training a model and tracking the loss, but when we are in production, I'm not sure if this dashboard is meant for that purpose.

  
  
Posted 5 months ago

so firs yes, I totally agree. This is why the clearml-serving has a dedicated statistics module that creates histograms over time, then we push it into Prometheus and connect grafana to it for dashboards and alerts.
To be honest, I would just use it instead of reporting manually, wdyt?

  
  
Posted 5 months ago

Hi @<1724960475575226368:profile|GloriousKoala29>

Is there a way to aggregate the results, such as defining an iteration as the accuracy of 100 samples

Hmm, i'm assuming what you actually want is to store it with the actual input/output and a score, is that correct?

  
  
Posted 5 months ago
463 Views
5 Answers
5 months ago
5 months ago
Tags