Unanswered
For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up
- Triton server does not support saving models off to normal RAM for faster loading/unloadingCorrect, the enterprise version also does not support RAM caching
Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
Correct, there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO)
134 Views
0
Answers
one year ago
one year ago