For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up

Unanswered

I checked Triton and found these references:

None
NoneIt appears that "they sell that" as Triton Management Service, part of None . It is possible to do through their API, but would need to be explicit. Moreover, there are likely a few different algorithms that could be used to maximize usage and minimize downtime. It would be nice to have at least a simple algorithm baked into ClearML for serving models at a smallish scale, such as:
Assume:- All models are of the same size when loaded
The max number of instances of an individual model is 1- Config:- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
Auto-unload model if not being used for x minutes (default 5?)
Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate

  				
Posted 
	one year ago

					More
				  		
  Report
		
					StrangePelican34
				
					0
					 × 1

144 Views

0 Answers

one year ago