Unanswered
For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up
I checked Triton and found these references:
- None
- NoneIt appears that "they sell that" as Triton Management Service, part of None . It is possible to do through their API, but would need to be explicit. Moreover, there are likely a few different algorithms that could be used to maximize usage and minimize downtime. It would be nice to have at least a simple algorithm baked into ClearML for serving models at a smallish scale, such as:
- Assume:- All models are of the same size when loaded
- The max number of instances of an individual model is 1- Config:- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
- Auto-unload model if not being used for x minutes (default 5?)
- Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
- If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
- Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
- If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
- If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate
127 Views
0
Answers
one year ago
one year ago