Unanswered
For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up
Let's see if I understand:
- Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
- ClearML can load and unload models based upon usage, but has to do so from the hard drive
- Triton server does not support saving models off to normal RAM for faster loading/unloading
- Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
If this is the case, that should be acceptable for our application.
126 Views
0
Answers
one year ago
one year ago