For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up
Let's see if I understand:
- Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
- ClearML can load and unload models based upon usage, but has to do so from the hard drive
- Triton server does not support saving models off to normal RAM for faster loading/unloading
- Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
If this is the case, that should be acceptable for our application.
one year ago
one year ago