Thanks, @<1523701205467926528:profile|AgitatedDove14> , for your feedback. Actually, I've been working with TRT-LLM since day zero of its launch. It is very good for LLMs, However, I haven't had the chance to check the trtllm-backend, as I'm waiting for some features there. However, I'm planning to use it and examine it. I will try to provide any feedback I have on that. But before doing so, I need to become more familiar with the internals of ClearML, I guess.
By the way, thanks for the feedback, and I will try to get back to you soon.
Hi @<1523701205467926528:profile|AgitatedDove14> , Thanks for answering, It's not what I meant. Suppose that I have three models and these models can't be loaded simultaneously on GPU memory( since there is not enough GPU ram for all of them at the same time). What I have in mind is this: is there an automatic way to unload a model (for example, if a model hasn't been run in the last 10 minutes, or something similar)? Or, if we don't have such an automatic method, can we manually unload the model from GPU memory to free up space for other models?(I know there is an endpoint for doing so in triton, but I don't know if possible to get access to these endpoint via clearml)?
I don't want it to be completely removed from my endpoints. Please suppose we have endpoint A; then the A model will be unloaded from memory. If we receive a request for A again, it will be loaded back into memory if there is enough space. If there isn't enough room, we can then assess which model to unload (suppose it is model B and we will unload it) to make room for model A.
For now, this is the behavior I observe: Suppose I have two models, A and B.
- When ClearML is started, the GPU memory usage is almost 0.
- Then, upon the first request to endpoint A, Model A is loaded into the GPU memory and remains there. At this point, Model B is not loaded.
- If we then send a request to Model B, it will be loaded into the memory too.
However, there is no way for me to unload Model A. Consequently, if there is another model, say Model C, it can't be loaded since we run out of memory.
@<1523701205467926528:profile|AgitatedDove14> That is awesome. Could you please provide me with the branch that you are working on or specific commit that can help me know how you are implementing it? Honestly, I want to get familiar with it and, if possible, contribute to the project.
@<1657918706052763648:profile|SillyRobin38> out of curiosity did you compare performance of tensorrt-llm vs vllm ?
(the jury is still out on that, just wondered if you had a chance)
@<1523701205467926528:profile|AgitatedDove14> No, I didn't do that, but if I'm not mistaken, about a month ago I saw some users on Reddit comparing it. They observed that TRT-LLM outperforms all kinds of leading backends, including VLLM. I will try to find it and paste it here.
Hi @<1657918706052763648:profile|SillyRobin38>
Hi everyone, I wanted to inquire if it's possible to have some type of model unloading.
What do you mean by "unloading" ? you mean remove it from the clearml-serving endpoint ?
If this is from the clearml-serving, then yes you can online :
None
Thanks @<1657918706052763648:profile|SillyRobin38> this is still in the internal git repo (we usually do not develop directly on github)
I want to get familiar with it and, if possible, contribute to the project.
This is a good place to start: None
we are still debating weather to sue it directly or as part of Triton ( None ) , would love to get your feedback
Suppose that I have three models and these models can't be loaded simultaneously on GPU memory(
Oh!!!
For now, this is the behavior I observe: Suppose I have two models, A and B. ....
Correct
Yes this is a current limitation of the Triton backend BUT!
we are working on a new version that does Exactly what you mentioned (because it is such a common case where in some cases models are not being used that frequently)
The main caveat is the loading time, re-loading models from dist takes way too much time at the moment (meaning you might get a timeout on the request), and we are trying to accelerate the process (for example cache model in RAM instead of GPU memory). But we made good progress and I'm sure the next version will be able to address that