Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

Hi everyone,
I'm using ClearML-Serving with Triton and have a couple of questions regarding model management:

  • Once a model is loaded into GPU memory for the first time, does it stay loaded across subsequent requests, or does it get unloaded and reloaded with each new request?
  • Are there configuration options available that allow us to control this behavior?Any insights or guidance on how to configure these settings would be greatly appreciated to help optimize our resource usage.
    Thank you!
  
  
Posted 6 months ago
Votes Newest

Answers 9


Hi Martin . Thanks for the answer . Ah so the delay in unloading cause a timeout . That speed depends on model sizes, right?

As a workaround, how about more
simple approach of unloading of the least used models after X minutes of sitting unused - enough to free up memory for any model to load? Hope that makes sense . This would not work under heavy loads, but eg we have models used once a week only . They would just stay unloaded until use - and could be offloaded afterwards .

  
  
Posted 6 months ago

... Would not work for huge llm style models.

yes I agree... but then if the model is small enough then you can just keep it in memory ...

  
  
Posted 6 months ago

Hi @<1713001673095385088:profile|EmbarrassedWalrus44>
So Triton has load/unload model, but these are slowwww, meaning you cannot use them inside a request (you'll just hit the request timeout every time it tries to load the model)
as you can see this is classified as "wish-list" , this is not trivial to implement and requires large CPU RAM to store the entire model, so "loading" becomes moving CPU to GPU memory (which also is not the fastest but the best you can do). As far as I understand there is no "target date" to this feature 😞

  
  
Posted 6 months ago

. That speed depends on model sizes, right?

in general yes

Hope that makes sense. This would not work under heavy loads, but eg we have models used once a week only. They would just stay unloaded until use - and could be offloaded afterwards.

but then you still might encounter timeout the first time you access them, no?

  
  
Posted 6 months ago

Thanks for asking about this - I have the exact same issue. Could the Triton model management API be used to load/unload the models?
https://github.com/triton-inference-server/server/issues/5345

  
  
Posted 6 months ago

Unless you set a very long time out . Usually all models load in less than 1 min, smaller ones much faster . Would not work for huge llm style models .

  
  
Posted 6 months ago

The models that fit into around 8-24Gb mem are quite common, at least here . If they are used rarely, and you have a lot, that is a lot of wasted gpu ressources . They can take about 10-40 secs to load . Hot swapping would be ideal, but as a fallback, unloading least used models to keep enough VMEM free to load any model on request . Tricky issue!

  
  
Posted 6 months ago

Hi @<1690896098534625280:profile|NarrowWoodpecker99>

Once a model is loaded into GPU memory for the first time, does it stay loaded across subsequent requests,

yes it does.

Are there configuration options available that allow us to control this behavior?

I'm assuming your're thinking dynamic loading/unloading models from memory based on requests?
I wish Triton added that 🙂 this is not trivial and in reality to be fast enough the model has to leave in RAM then moved to GPU (which actually takes a while)

  
  
Posted 6 months ago

Maybe combining the two, with an unload gRPC api we could have that ability moved to the "preprocessing" logic, wdyt?

  
  
Posted 6 months ago
517 Views
9 Answers
6 months ago
6 months ago
Tags