The models that fit into around 8-24Gb mem are quite common, at least here . If they are used rarely, and you have a lot, that is a lot of wasted gpu ressources . They can take about 10-40 secs to load . Hot swapping would be ideal, but as a fallback, unloading least used models to keep enough VMEM free to load any model on request . Tricky issue!
Unless you set a very long time out . Usually all models load in less than 1 min, smaller ones much faster . Would not work for huge llm style models .
Hi Martin . Thanks for the answer . Ah so the delay in unloading cause a timeout . That speed depends on model sizes, right?
As a workaround, how about more
simple approach of unloading of the least used models after X minutes of sitting unused - enough to free up memory for any model to load? Hope that makes sense . This would not work under heavy loads, but eg we have models used once a week only . They would just stay unloaded until use - and could be offloaded afterwards .
Thanks for asking about this - I have the exact same issue. Could the Triton model management API be used to load/unload the models?
https://github.com/triton-inference-server/server/issues/5345