As I understand it, vertical scaling means giving each container more resources to work with. This should always be possible in a k8s context, because you decide which types of machines go in your pool and your define the requirements for each container yourself 🙂 So if you want to set the container to use 10.000 CPUs feel free! Unless you mean something else with this, in which case please counter!
Usually those models are Pytorch right? So, yeah, you should be able to, feel free to follow the Pytorch example if you want to know how 🙂
Prerequisites, PyTorch models require Triton engine support, please use docker-compose-triton.yml / docker-compose-triton-gpu.yml or if running on Kubernetes, the matching helm chart.
Thanks, my question is dumb indeed 🙂 Thanks for the reply !
Sure! This is an example of running a custom model. It basically boils down to defining a preprocess, process and postprocess
function. Inside the process
function can be anything, including just a basic call to huggingface to run inference 🙂
I have not tested this myself mind you, but I see no reason why it wouldn't work!
In fact, I think even Triton itself supports running on CPU these days, so you still have the option :)
That wasn't my intention! Not a dumb question, just a logical one 😄
I would like to know if it is possible to run any pytorch model on the basic docker compose file ? Without triton?
Sorry to come back to this! Regarding the Kubernetes Serving helm chart, I can see horyzontal scaling of docker containers. What about vertical scaling? Is it implemented? More specifically, where is defined the SKU of the VMs in use?
Sorry, I jumped the gun before I fully understood your question 🙂 So with simple docker compose file, you mean you don't want to use docker-compose-triton.yaml
file and so want to run the huggingface model on CPU instead of Triton?
Or do you want to know if the general docker compose version is able to handle a huggingface model?
In production, we should use the clearml-helm-charts
right? Docker-compose in the clearml-serving is more for local testing
I basically would like to know if we can serve the model without tensorrt format which is highly efficient but more complicated to get.