I would like to be able to send a request to unload the model (because I cannot load all the models in gpu, only 7-8) o
Hmm is this part of the gRPC interface of Triton? if it is, we should be able to add that quite easily,
Thank you for your answer, I added 100s models in the serving session, and when I send a post request it loads the willing model to perform an inference. I would like to be able to send a request to unload the model (because I cannot load all the models in gpu, only 7-8) or as @<1690896098534625280:profile|NarrowWoodpecker99> suggests add a timeout ? Or unload all the models if the gpu memory reach a limit ? Do you have a suggestion on how I could achieve that? Thanks!
Hi @<1683648242530652160:profile|ApprehensiveSeaturtle9>
I send a request to the endpoint but never unload (the gpu memory keep increasing when I infer with a new model).
They are not unloaded after the request is done. see discussion here: None
You can however remove the model from the serving session (but I do not think this is what you meant)
I'm assuming you want to run multiple models on a single GPU with not enough memory ?