can you tell me what the serving example is in terms of the explanation above and what the triton serving engine is,
This line actually creates the control Task (2)
clearml-serving triton --project "serving" --name "serving example"
This line configures the control Task (the idea is that you can do that even when the control Task is already running, but in this case it is still in draft mode).
Notice the actual model serving configuration is already stored on the creating Task/Model. Otherwise you have to explicitly provide configuration on the model serving, i.e. input matrix size type etc. this is the configpb.txt file, example [ https://github.com/allegroai/clearml-serving/blob/7c1c02c9ea49c9ee6ffbdd5b59f5fd8a6f78b4e0/examples/keras/keras_mnist.py#L51 ])
clearml-serving triton --endpoint "keras_mnist" --model-project "examples" --model-name "Keras MNIST serve example - serving_model"
Then you launch the control Task (2) (this is the one we just configured, by default it qwill launch on the services queue, you can also spin an additional agent to listen to the services queue).
The control Task is actually the Task that creates the serving Task, and enqueues it.
(The idea is that it will do auto load balancing, based on serving performance, right now it is still static).
To control the way the serving Task is created and enqueued , check the full help:
` clearml-serving --help
clearml-serving triton --help `
I'm assuming the triton serving engine is running on the serving queue in my case. Is the serving example also running on the serving queue or is it running on the services queue? And lastly, I don't have a clearml agent listening to the services queue, does clearml do this on its own?
Yes the serving is a bit complicated. Let me try to explain the underlying setup, before going into more details.
clearml-serving CLI is a tool to launch / setup. (it does the configuration and enqueuing not the actual serving) control plan Task -> Storing the state of the serving (i.e. which end points needs to be served, what models are used, collects stats). This Task has no actual communication with the serving requests/replies (Running on the services queue) Serving Task -> actual Task doing the serving (supports multiple instances). This is where the requested are routed to, and where the inference happens. It pulls the configuration from the controlplan Task, and configure itself based on it. it also reports back stats to the controlplan on its performance. This is where The Triton Engine is running, inside the Triton container with clearml-running inside the same container pulling the actual models and feeding them to Triton server (Running on a GPU/CPU queue)
Does that make sense ?