The agent ip? Generally what’s the expected pattern to deploy and scale this for multiple models?
Yes the agent's IP, and with multiple agents, one would probably use k8s for the nodes, then configure ingest. This is the next step for the cleaml-serving, adding support for KFServing or manually configuring the ingest. wdyt?
forking and using the latest code fixes the boto issue at least
Think I will have to fork and play around with it 🙂
Yes, I have no experience with triton does it do lazy loading? Was wondering how it can handle 10s, 100s of models. If we load balance across a set of these engine containers with say 100 models and all of these models get traffic but distribution is not even, each of those engine container will load all those 100 models?
Do we launch multiple gorups of these in different projects?
Actually Triton can serve multiple models and the endpoints/models are controlled from the clearml-serving.
The only issue is adding a load-balancer in front of multiple nodes to balance the requests between them. wdyt?
I am also not understanding how clearml-serving is doing the version for models in triton.
AgitatedDove14 - it does have boto but the clearml-serving installation and code refers to older commit hash and hence the task was not using them - https://github.com/allegroai/clearml-serving/blob/main/clearml_serving/serving_service.py#L217
I think you are correct, it seems like it is missing requirements to boto/azure/google (I will make sure this is added). In the meantime, you can stop the "triton serving engine" Task, reset it, add boto3 to the installed packages and relaunch.
That said your main issue might be packaging the python model. Basically you need to create a model from the entire folder (with whatever there is inside the folder), then Triton should be able to run it (if the config.pbtxt is correct).m = OutputModel() m.update_weights_package(weights_path='path/goes/here/')
Think I will have to fork and play around with it
NICE! (BTW: if you manage to get it working I'll be more than happy to help push the PR)
Maybe the quickest win is to store just the .py as model ?
Got the engine running.
curl <serving-engine-ip>:8000/v2/models/keras_mnist/versions/1
What’s the serving-engine-ip supposed to be?
That makes sense - one part I am confused on is - The Triton engine container hosts all the models right? Do we launch multiple gorups of these in different projects?
I used .update_weights(path)
with path being the model
dir containing the model.py annd the config.pbtxt. Should I use update_weights_package
?
Without some sort of automation on top feels a bit fragile
Also btw, is this supposed to be screenshot from community verison? https://github.com/manojlds/clearml-serving/blob/main/docs/webapp_screenshots.gif
Model says PACKAGE, that means it’s fine right?
Should I use
update_weights_package
Yes
BTW, config.pbtxt you should pass when "registering" the endpoint with the CLI
Also btw, is this supposed to be screenshot from community verison
Hmm seems like screenshot from an enterprise version, I'll ask them to update 🙂
I am also not understanding how clearml-serving is doing the version for models in triton.
Basically you have two Tasks, one is the "controller" checking model changes and updating itself.
The other is the engine, checking on the "controller" Task, which models it needs to download/configure and replaces them.
This way you can have multiple engines controlled from the same "controller" Task
On my to do list, but will have to wait for later this week (feel free to ping on this thread to remind me).
Regrading the issue at hand, let me check the requirements it is using.
Initially it was complaining about it, but then when I did the connect_configuration it started working
AgitatedDove14 - looks like the serving is doing the savemodel stuff?
https://github.com/allegroai/clearml-serving/blob/main/clearml_serving/serving_service.py#L554
For now that's a quick thing, but for actual use I will need a proper model (pkl) and the .py
Was wondering how it can handle 10s, 100s of models.
Yes, it supports dynamically loading/unloading models based on requests
(load balancing multiple nodes is disconnected from it, but assuming they are under diff endpoints, the load balancer can be configured to route accordingly)
The agent ip? Generally what’s the expected pattern to deploy and scale this for multiple models?