Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I Wanted To Inquire If It'S Possible To Have Some Type Of Model Unloading. I Know There Was A Discussion Here About It, But After Reviewing It, I Didn'T Find An Answer. So, I Am Curious: Is It Possible To Explicitly Unload A Model (By Calling

Hi everyone, I wanted to inquire if it's possible to have some type of model unloading. I know there was a discussion here about it, but after reviewing it, I didn't find an answer. So, I am curious: Is it possible to explicitly unload a model (by calling the endpoint) or, preferably, to do it automatically? My problem is that I have several models, and I don't want all of them to be in the GPU memory at the same time.

  
  
Posted 4 months ago
Votes Newest

Answers 8


Hi @<1657918706052763648:profile|SillyRobin38>

Hi everyone, I wanted to inquire if it's possible to have some type of model unloading.

What do you mean by "unloading" ? you mean remove it from the clearml-serving endpoint ?
If this is from the clearml-serving, then yes you can online :
None

  
  
Posted 4 months ago

Hi @<1523701205467926528:profile|AgitatedDove14> , Thanks for answering, It's not what I meant. Suppose that I have three models and these models can't be loaded simultaneously on GPU memory( since there is not enough GPU ram for all of them at the same time). What I have in mind is this: is there an automatic way to unload a model (for example, if a model hasn't been run in the last 10 minutes, or something similar)? Or, if we don't have such an automatic method, can we manually unload the model from GPU memory to free up space for other models?(I know there is an endpoint for doing so in triton, but I don't know if possible to get access to these endpoint via clearml)?

I don't want it to be completely removed from my endpoints. Please suppose we have endpoint A; then the A model will be unloaded from memory. If we receive a request for A again, it will be loaded back into memory if there is enough space. If there isn't enough room, we can then assess which model to unload (suppose it is model B and we will unload it) to make room for model A.

For now, this is the behavior I observe: Suppose I have two models, A and B.

  • When ClearML is started, the GPU memory usage is almost 0.
  • Then, upon the first request to endpoint A, Model A is loaded into the GPU memory and remains there. At this point, Model B is not loaded.
  • If we then send a request to Model B, it will be loaded into the memory too.
    However, there is no way for me to unload Model A. Consequently, if there is another model, say Model C, it can't be loaded since we run out of memory.
  
  
Posted 4 months ago

Thanks @<1657918706052763648:profile|SillyRobin38> this is still in the internal git repo (we usually do not develop directly on github)

I want to get familiar with it and, if possible, contribute to the project.

This is a good place to start: None
we are still debating weather to sue it directly or as part of Triton ( None ) , would love to get your feedback

  
  
Posted 4 months ago

@<1523701205467926528:profile|AgitatedDove14> That is awesome. Could you please provide me with the branch that you are working on or specific commit that can help me know how you are implementing it? Honestly, I want to get familiar with it and, if possible, contribute to the project.

  
  
Posted 4 months ago

Thanks, @<1523701205467926528:profile|AgitatedDove14> , for your feedback. Actually, I've been working with TRT-LLM since day zero of its launch. It is very good for LLMs, However, I haven't had the chance to check the trtllm-backend, as I'm waiting for some features there. However, I'm planning to use it and examine it. I will try to provide any feedback I have on that. But before doing so, I need to become more familiar with the internals of ClearML, I guess.

By the way, thanks for the feedback, and I will try to get back to you soon.

  
  
Posted 4 months ago

@<1657918706052763648:profile|SillyRobin38> out of curiosity did you compare performance of tensorrt-llm vs vllm ?
(the jury is still out on that, just wondered if you had a chance)

  
  
Posted 4 months ago

@<1523701205467926528:profile|AgitatedDove14> No, I didn't do that, but if I'm not mistaken, about a month ago I saw some users on Reddit comparing it. They observed that TRT-LLM outperforms all kinds of leading backends, including VLLM. I will try to find it and paste it here.

  
  
Posted 4 months ago

Suppose that I have three models and these models can't be loaded simultaneously on GPU memory(

Oh!!!

For now, this is the behavior I observe: Suppose I have two models, A and B. ....

Correct

Yes this is a current limitation of the Triton backend BUT!
we are working on a new version that does Exactly what you mentioned (because it is such a common case where in some cases models are not being used that frequently)
The main caveat is the loading time, re-loading models from dist takes way too much time at the moment (meaning you might get a timeout on the request), and we are trying to accelerate the process (for example cache model in RAM instead of GPU memory). But we made good progress and I'm sure the next version will be able to address that

  
  
Posted 4 months ago
226 Views
8 Answers
4 months ago
3 months ago
Tags