AgitatedDove14 - looks like the serving is doing the savemodel stuff?
The agent ip? Generally what’s the expected pattern to deploy and scale this for multiple models?
Yes the agent's IP, and with multiple agents, one would probably use k8s for the nodes, then configure ingest. This is the next step for the cleaml-serving, adding support for KFServing or manually configuring the ingest. wdyt?
AgitatedDove14 - it does have boto but the clearml-serving installation and code refers to older commit hash and hence the task was not using them - https://github.com/allegroai/clearml-serving/blob/main/clearml_serving/serving_service.py#L217
Also btw, is this supposed to be screenshot from community verison? https://github.com/manojlds/clearml-serving/blob/main/docs/webapp_screenshots.gif
I think you are correct, it seems like it is missing requirements to boto/azure/google (I will make sure this is added). In the meantime, you can stop the "triton serving engine" Task, reset it, add boto3 to the installed packages and relaunch.
That said your main issue might be packaging the python model. Basically you need to create a model from the entire folder (with whatever there is inside the folder), then Triton should be able to run it (if the config.pbtxt is correct).
m = OutputModel() m.update_weights_package(weights_path='path/goes/here/')
Do we launch multiple gorups of these in different projects?
Actually Triton can serve multiple models and the endpoints/models are controlled from the clearml-serving.
The only issue is adding a load-balancer in front of multiple nodes to balance the requests between them. wdyt?
Yes, I have no experience with triton does it do lazy loading? Was wondering how it can handle 10s, 100s of models. If we load balance across a set of these engine containers with say 100 models and all of these models get traffic but distribution is not even, each of those engine container will load all those 100 models?
Was wondering how it can handle 10s, 100s of models.
Yes, it supports dynamically loading/unloading models based on requests
(load balancing multiple nodes is disconnected from it, but assuming they are under diff endpoints, the load balancer can be configured to route accordingly)
Also btw, is this supposed to be screenshot from community verison
Hmm seems like screenshot from an enterprise version, I'll ask them to update 🙂
I am also not understanding how clearml-serving is doing the version for models in triton.
Basically you have two Tasks, one is the "controller" checking model changes and updating itself.
The other is the engine, checking on the "controller" Task, which models it needs to download/configure and replaces them.
This way you can have multiple engines controlled from the same "controller" Task