For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up

Answered

For ClearML serving, if I am trying to deploy 100 models on a GPU that can handle 5 concurrently, but each one will be sporadically used (fine tuned models trained for different customers), can ClearML-serving automatically load and unload models based upon usage or will I have to manually manage the process? To my understanding, this would mean that when a customer inferences off of their model, there may be a 5 second latency for the first inference, but then it would be fast (unless > 5 customers are trying to access their models at the same time 🙂 ).

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StrangePelican34
				
					0
					 × 1

Votes Newest

Answers 7

I checked Triton and found these references:

None
NoneIt appears that "they sell that" as Triton Management Service, part of None . It is possible to do through their API, but would need to be explicit. Moreover, there are likely a few different algorithms that could be used to maximize usage and minimize downtime. It would be nice to have at least a simple algorithm baked into ClearML for serving models at a smallish scale, such as:
Assume:- All models are of the same size when loaded
The max number of instances of an individual model is 1- Config:- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
Auto-unload model if not being used for x minutes (default 5?)
Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StrangePelican34
				
					0
					 × 1

That is great to hear! Is there any documentation on how it works, and if it can be configured?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StrangePelican34
				
					0
					 × 1

It appears that "they sell that" as Triton Management Service, part of

. It is possible to do through their API, but would need to be explicit.

We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM.
That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments.
Does that make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Let's see if I understand:

Triton server deployments only have manual, static deployment of models for inferencing (without enterprise)
ClearML can load and unload models based upon usage, but has to do so from the hard drive
Triton server does not support saving models off to normal RAM for faster loading/unloading
Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
If this is the case, that should be acceptable for our application.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StrangePelican34
				
					0
					 × 1

If ClearML does not implement this, we may have to ourselves - None .

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StrangePelican34
				
					0
					 × 1

Hi @<1523711619815706624:profile|StrangePelican34>

if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,

Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Triton server does not support saving models off to normal RAM for faster loading/unloadingCorrect, the enterprise version also does not support RAM caching

Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.

Correct, there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

7 Answers

2 years ago