Hi @<1523701205467926528:profile|AgitatedDove14> , Thanks for answering, It's not what I meant. Suppose that I have three models and these models can't be loaded simultaneously on GPU memory( since there is not enough GPU ram for all of them at the same time). What I have in mind is this: is there an automatic way to unload a model (for example, if a model hasn't been run in the last 10 minutes, or something similar)? Or, if we don't have such an automatic method, can we manually unload the model from GPU memory to free up space for other models?(I know there is an endpoint for doing so in triton, but I don't know if possible to get access to these endpoint via clearml)?
I don't want it to be completely removed from my endpoints. Please suppose we have endpoint A; then the A model will be unloaded from memory. If we receive a request for A again, it will be loaded back into memory if there is enough space. If there isn't enough room, we can then assess which model to unload (suppose it is model B and we will unload it) to make room for model A.
For now, this is the behavior I observe: Suppose I have two models, A and B.
- When ClearML is started, the GPU memory usage is almost 0.
- Then, upon the first request to endpoint A, Model A is loaded into the GPU memory and remains there. At this point, Model B is not loaded.
- If we then send a request to Model B, it will be loaded into the memory too.
However, there is no way for me to unload Model A. Consequently, if there is another model, say Model C, it can't be loaded since we run out of memory.