Unanswered
Hi Everyone,
I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:
Hi @<1690896098534625280:profile|NarrowWoodpecker99>
Once a model is loaded into GPU memory for the first time, does it stay loaded across subsequent requests,
yes it does.
Are there configuration options available that allow us to control this behavior?
I'm assuming your're thinking dynamic loading/unloading models from memory based on requests?
I wish Triton added that 🙂 this is not trivial and in reality to be fast enough the model has to leave in RAM then moved to GPU (which actually takes a while)
57 Views
0
Answers
4 months ago
4 months ago