Hi SubstantialElk6
noted that clearml-serving does not support Spacy models out of the box and
So this is a good point.
To add any pissing package to the preprocessing docker you can just add them in the following environment variable here: https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/docker/docker-compose.yml#L83EXTRA_PYTHON_PACKAGES="spacy>1"
Regrading a custom engine, basically this is supported with --engine custom
you can see the model pipeline example uses that same workflow:
https://github.com/allegroai/clearml-serving/blob/main/examples/pipeline/preprocess.py
Instructions on setup here:
https://github.com/allegroai/clearml-serving/tree/main/examples/pipeline
We should add a custom example (any chance you can add a git issue, so we do not forget)
specifically to spacy, how could we separate the pre/post processing with the actual model inference easily? I would love to add it as a custom engine or example
2 and 3 - I want to manage access control over the RestAPI
Long story short, put a load-balancer in front of the entire thing (see the k8s setup), and have the load-balancer verify JWT token as authentication (this is usually the easiest)
1- Exactly, custom code
Yes, we need to add a custom example there (somehow forgotten)
Could you open an Issue for that?
in the meantime:
` #
Preprocess class Must be named "Preprocess"
No need to inherit or to implement all methods
lass Preprocess(object):
"""
Preprocess class Must be named "Preprocess"
Otherwise there are No limitations, No need to inherit or to implement all methods
Notice! This is not thread safe! the same instance may be accessed from multiple threads simultaneously
"""
def __init__(self):
# set internal state, this will be called only once. (i.e. not per request)
# it will also set the internal model_endpoint to reference the specific model endpoint object being served
self.model_endpoint = None # type: clearml_serving.serving.endpoints.ModelEndpoint
self._model = None
def load(self, local_file_name: str) -> Optional[Any]: # noqa
"""
Optional: provide loading method for the model
useful if we need to load a model in a specific way for the prediction engine to work
:param local_file_name: file name / path to read load the model from
:return: Object is stored on self._model
"""
pass
def preprocess(
self,
body: dict,
state: dict,
collect_custom_statistics_fn: Optional[Callable[[dict], None]],
) -> Any: # noqa
"""
Optional: do something with the request data, return any type of object.
The returned object will be passed as is to the inference engine
:param body: dictionary as recieved from the RestAPI
:param state: Use state dict to store data passed to the post-processing function call.
This is a per-request state dict (meaning a new empty dict will be passed per request)
Usage example:
>>> def preprocess(..., state):
state['preprocess_aux_data'] = [1,2,3]
>>> def postprocess(..., state):
print(state['preprocess_aux_data'])
:param collect_custom_statistics_fn: Optional, if provided allows to send a custom set of key/values
to the statictics collector servicd.
None is passed if statiscs collector is not configured, or if the current request should not be collected
Usage example:
>>> print(body)
{"x0": 1, "x1": 2}
>>> if collect_custom_statistics_fn:
>>> collect_custom_statistics_fn({"x0": 1, "x1": 2})
:return: Object to be passed directly to the model inference
"""
return body
def postprocess(
self,
data: Any,
state: dict,
collect_custom_statistics_fn: Optional[Callable[[dict], None]],
) -> dict: # noqa
"""
Optional: post process the data returned from the model inference engine
returned dict will be passed back as the request result as is.
:param data: object as recieved from the inference model function
:param state: Use state dict to store data passed to the post-processing function call.
This is a per-request state dict (meaning a dict instance per request)
Usage example:
>>> def preprocess(..., state):
state['preprocess_aux_data'] = [1,2,3]
>>> def postprocess(..., state):
print(state['preprocess_aux_data'])
:param collect_custom_statistics_fn: Optional, if provided allows to send a custom set of key/values
to the statictics collector servicd.
None is passed if statiscs collector is not configured, or if the current request should not be collected
Usage example:
>>> if collect_custom_statistics_fn:
>>> collect_custom_statistics_fn({"y": 1})
:return: Dictionary passed directly as the returned result of the RestAPI
"""
return data
def process(
self,
data: Any,
state: dict,
collect_custom_statistics_fn: Optional[Callable[[dict], None]],
) -> Any: # noqa
"""
Optional: do something with the actual data, return any type of object.
The returned object will be passed as is to the postprocess function engine
:param data: object as recieved from the preprocessing function
:param state: Use state dict to store data passed to the post-processing function call.
This is a per-request state dict (meaning a dict instance per request)
Usage example:
>>> def preprocess(..., state):
state['preprocess_aux_data'] = [1,2,3]
>>> def postprocess(..., state):
print(state['preprocess_aux_data'])
:param collect_custom_statistics_fn: Optional, if provided allows to send a custom set of key/values
to the statictics collector servicd.
None is passed if statiscs collector is not configured, or if the current request should not be collected
Usage example:
>>> if collect_custom_statistics_fn:
>>> collect_custom_statistics_fn({"type": "classification"})
:return: Object to be passed tp the post-processing function
"""
return data `
Does that help?
1- Thanks, I need to experiment on that code. I guess I should use the AI model in the "process" function, correct?
2- Regarding the RestAPI authentication, I have a multi-tenant application, so I need different auth keys for each endpoint. Any ideas?
3- I admit I don't know the best practices for authorizing users with serving engines. If I am creating my RestAPI, that is easy and flexibel. But how people solve this for TorcheServe, Triton etc? Using load balancer auth as you suggested?
- Correct. Basically the order is restapi body dictionary-> preprocess -> process -> post-process -> restapi dictionary return
Hi PerplexedCow66
I would like to know how to serve a model, even if I do not use any serving engine
What do you mean no serving engine, i.e. custom code?
Besides that, how can I manage authorization between multiple endpoints?
Are you referring to limiting access to all the endpoints?
How can I manage API keys to control who can access my endpoints?
Just to be clear, accessing the endpoints has nothing to do with the clearml-server credentials, so are you asking how to add access control over RestAPI ?
AgitatedDove14 The pipeline example is a bit unclear, I would like to know how to serve a model, even if I do not use any serving engine. Can you please formulate a more complete example on that? Besides that, how can I manage authorization between multiple endpoints? How can I manage API keys to control who can access my endpoints?
Many thanks!
Hello AgitatedDove14 , i will summarize the responses:
1- Exactly, custom code
2 and 3 - I want to manage access control over the RestAPI
AgitatedDove14 is it possible that every endpoint controls its own JWT token in clearml-serving? In my multi-tenant application, the tenants can only access their own endpoints, that is why I need to authenticate them before using it.
Regarding the load-balancer authorizing users with JWT, do you have links/resources where I can take a deeper look? Thanks
2,3 ) the question is whether the serving is changing from one tenant to another, does it?
yes, every tenant has their own serving endpoint
Hi PerplexedCow66
I'm assuming an extension for this:
https://github.com/allegroai/clearml-serving/issues/32
Basically JWT can be used as a general access/block all endpoints, which is most efficnely used if handled by k8s loadbalancer (nginx/envoy),
but if you want a per-endpoint check (or maybe do something based on the JWT values)
See adding JWT to FastAPI here:
https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/?h=jwt#oauth2-with-password-and-hashing-bearer-with-jwt-tokens
Then you can basically quickly change this section to add the JWT (or use parts of the decoded JWT dict for routing)
https://github.com/allegroai/clearml-serving/blob/826f503cf4a9b069b89eb053696d218d1ce26f47/clearml_serving/serving/main.py#L90
wdty?
These are maybe good features to include in ClearML:
or
.
Sure, we should probably add a section into the doc explaining how to do that
Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
I have to admit that to me this is a much better solution (then my/bento integrated JWT option). Generally speaking I think this is the best approach, it separates authentication layer from execution layer (i.e. JWT parsing / auth in your code vs runnign models in clearml-serving), allows maximum flexibility and does not require to change any 3rd party (i.e. clearml) code.
wdyt?
Thanks, these are good tips. I will think on that. Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
These are maybe good features to include in ClearML: https://docs.bentoml.org/en/latest/guides/securing_endpoints.html or https://docs.bentoml.org/en/latest/guides/server.html .
Thank you for your attention and the short discussion!
Besides that, what are your impressions on these serving engines? Are they much better than just creating my own API + ONNX or even my own API + normal Pytorch inference?
I would separate ML frameworks from DL frameworks.
With ML frameworks, the main advantage is multi-model serving on a single container, which is more cost effective when it comes to multiple model serving. As well as the ability to quickly update models from the clearml model repository (just tag + publish and the endpoint serving ill auto refresh themselves). There is no actual inference performance per model, but globally it is more efficient.
With DL, obviously all the ML advantages hold, but the main value is the fact we separate the preprocessing to a CPU instance and DL to GPU instance, and this is a huge performance boost. On top, we have the fact that the GPU instance can serve multiple models at the same time (again cost effective). The actual DL model inference boost comes from using Triton as an engine, Nvidia works hard for it to be super optimized in inference and they did a great job with it.
Agreed! I was trying to avoid this, because I wanted that each tenant acess directly the serving endpoint, to maximize performance. But I guess I will loose just a few ms separating auth layer and execution layer.
Besides that, what are your impressions on these serving engines? Are they much better than just creating my own API + ONNX or even my own API + normal Pytorch inference?
For example, if I decide to use clearml-serving --engine custom
, what would be the drawbacks?