Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

Answered

Hi everyone,
I'm using ClearML-Serving with Triton and have a couple of questions regarding model management:

Once a model is loaded into GPU memory for the first time, does it stay loaded across subsequent requests, or does it get unloaded and reloaded with each new request?
Are there configuration options available that allow us to control this behavior?Any insights or guidance on how to configure these settings would be greatly appreciated to help optimize our resource usage.
Thank you!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NarrowWoodpecker99
				
					0
					 × 1

Votes Newest

Answers 9

Maybe combining the two, with an unload gRPC api we could have that ability moved to the "preprocessing" logic, wdyt?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi Martin . Thanks for the answer . Ah so the delay in unloading cause a timeout . That speed depends on model sizes, right?

As a workaround, how about more
simple approach of unloading of the least used models after X minutes of sitting unused - enough to free up memory for any model to load? Hope that makes sense . This would not work under heavy loads, but eg we have models used once a week only . They would just stay unloaded until use - and could be offloaded afterwards .

  				
Posted 
	one year ago

					More
				  		
  Report
		
					EmbarrassedWalrus44
				
					0

Unless you set a very long time out . Usually all models load in less than 1 min, smaller ones much faster . Would not work for huge llm style models .

  				
Posted 
	one year ago

					More
				  		
  Report
		
					EmbarrassedWalrus44
				
					0

... Would not work for huge llm style models.

yes I agree... but then if the model is small enough then you can just keep it in memory ...

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

. That speed depends on model sizes, right?

in general yes

Hope that makes sense. This would not work under heavy loads, but eg we have models used once a week only. They would just stay unloaded until use - and could be offloaded afterwards.

but then you still might encounter timeout the first time you access them, no?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1690896098534625280:profile|NarrowWoodpecker99>

Once a model is loaded into GPU memory for the first time, does it stay loaded across subsequent requests,

yes it does.

Are there configuration options available that allow us to control this behavior?

I'm assuming your're thinking dynamic loading/unloading models from memory based on requests?
I wish Triton added that 🙂 this is not trivial and in reality to be fast enough the model has to leave in RAM then moved to GPU (which actually takes a while)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1713001673095385088:profile|EmbarrassedWalrus44>
So Triton has load/unload model, but these are slowwww, meaning you cannot use them inside a request (you'll just hit the request timeout every time it tries to load the model)
as you can see this is classified as "wish-list" , this is not trivial to implement and requires large CPU RAM to store the entire model, so "loading" becomes moving CPU to GPU memory (which also is not the fastest but the best you can do). As far as I understand there is no "target date" to this feature 😞

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for asking about this - I have the exact same issue. Could the Triton model management API be used to load/unload the models?
https://github.com/triton-inference-server/server/issues/5345

  				
Posted 
	one year ago

					More
				  		
  Report
		
					EmbarrassedWalrus44
				
					0

The models that fit into around 8-24Gb mem are quite common, at least here . If they are used rarely, and you have a lot, that is a lot of wasted gpu ressources . They can take about 10-40 secs to load . Hot swapping would be ideal, but as a fallback, unloading least used models to keep enough VMEM free to load any model on request . Tricky issue!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					EmbarrassedWalrus44
				
					0

Write your answer

1K Views

9 Answers

one year ago