Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up

For ClearML serving, if I am trying to deploy 100 models on a GPU that can handle 5 concurrently, but each one will be sporadically used (fine tuned models trained for different customers), can ClearML-serving automatically load and unload models based upon usage or will I have to manually manage the process? To my understanding, this would mean that when a customer inferences off of their model, there may be a 5 second latency for the first inference, but then it would be fast (unless > 5 customers are trying to access their models at the same time 🙂 ).

  
  
Posted 11 months ago
Votes Newest

Answers 7


Let's see if I understand:

  • Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
  • ClearML can load and unload models based upon usage, but has to do so from the hard drive
  • Triton server does not support saving models off to normal RAM for faster loading/unloading
  • Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
    If this is the case, that should be acceptable for our application.
  
  
Posted 11 months ago

If ClearML does not implement this, we may have to ourselves - None .

  
  
Posted 11 months ago

Hi @<1523711619815706624:profile|StrangePelican34>

if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,

Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent

  
  
Posted 11 months ago

  • Triton server does not support saving models off to normal RAM for faster loading/unloadingCorrect, the enterprise version also does not support RAM caching

Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.

Correct, there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO)

  
  
Posted 11 months ago

I checked Triton and found these references:

  • None
  • NoneIt appears that "they sell that" as Triton Management Service, part of None . It is possible to do through their API, but would need to be explicit. Moreover, there are likely a few different algorithms that could be used to maximize usage and minimize downtime. It would be nice to have at least a simple algorithm baked into ClearML for serving models at a smallish scale, such as:
  • Assume:- All models are of the same size when loaded
  • The max number of instances of an individual model is 1- Config:- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
  • Auto-unload model if not being used for x minutes (default 5?)
  • Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
  • If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
  • Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
  • If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
  • If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate
  
  
Posted 11 months ago

That is great to hear! Is there any documentation on how it works, and if it can be configured?

  
  
Posted 11 months ago

It appears that "they sell that" as Triton Management Service, part of

. It is possible to do through their API, but would need to be explicit.

We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM.
That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments.
Does that make sense ?

  
  
Posted 11 months ago
750 Views
7 Answers
11 months ago
11 months ago
Tags