EmbarrassedWalrus44

0 Questions, 4 Answers

Active since 16 June 2024

Last activity one year ago

Reputation

Answers 4

0 Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

Hi Martin . Thanks for the answer . Ah so the delay in unloading cause a timeout . That speed depends on model sizes, right?

As a workaround, how about more
simple approach of unloading of the least used models after X minutes of sitting unused - enough to free up memory for any model to load? Hope that makes sense . This would not work under heavy loads, but eg we have models used once a week only . They would just stay unloaded until use - and could be offloaded afterwards .

one year ago

0 Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

Thanks for asking about this - I have the exact same issue. Could the Triton model management API be used to load/unload the models?
https://github.com/triton-inference-server/server/issues/5345

one year ago

0 Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

Unless you set a very long time out . Usually all models load in less than 1 min, smaller ones much faster . Would not work for huge llm style models .

one year ago

0 Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

The models that fit into around 8-24Gb mem are quite common, at least here . If they are used rarely, and you have a lot, that is a lot of wasted gpu ressources . They can take about 10-40 secs to load . Hot swapping would be ideal, but as a fallback, unloading least used models to keep enough VMEM free to load any model on request . Tricky issue!

one year ago