Unanswered
			
			
 
			
	
		
			
		
		
		
		
	
			
		
		For Clearml Serving, If I Am Trying To Deploy 100 Models On A Gpu That Can Handle 5 Concurrently, But Each One Will Be Sporadically Used (Fine Tuned Models Trained For Different Customers), Can Clearml-Serving Automatically Load And Unload Models Based Up
- Triton server does not support saving models off to normal RAM for faster loading/unloadingCorrect, the enterprise version also does not support RAM caching
Therefore, currently, we can deploy 100 models when only 5 can be concurrently loaded, but when they are unloaded/loaded (automatically by ClearML), it will take a few seconds because it is being read from the the SSD, depending on the size.
Correct, there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO)
318 Views
				0
Answers
				
					 
	one year ago
				
					
						 
	one year ago
					
			