Let's see if I understand:
- Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
- ClearML can load and unload models based upon usage, but has to do so from the hard drive
- Triton server does not support saving models off to normal RAM for faster loading/unloading
- Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
If this is the case, that should be acceptable for our application.
If ClearML does not implement this, we may have to ourselves - None .
Hi @<1523711619815706624:profile|StrangePelican34>
if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,
Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent
- Triton server does not support saving models off to normal RAM for faster loading/unloadingCorrect, the enterprise version also does not support RAM caching
Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
Correct, there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO)
I checked Triton and found these references:
- None
- NoneIt appears that "they sell that" as Triton Management Service, part of None . It is possible to do through their API, but would need to be explicit. Moreover, there are likely a few different algorithms that could be used to maximize usage and minimize downtime. It would be nice to have at least a simple algorithm baked into ClearML for serving models at a smallish scale, such as:
- Assume:- All models are of the same size when loaded
- The max number of instances of an individual model is 1- Config:- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
- Auto-unload model if not being used for x minutes (default 5?)
- Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
- If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
- Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
- If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
- If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate
That is great to hear! Is there any documentation on how it works, and if it can be configured?
It appears that "they sell that" as Triton Management Service, part of
. It is possible to do through their API, but would need to be explicit.
We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM.
That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments.
Does that make sense ?