, but are you suggesting sending the requests to Triton frame-by-frame?
yes! trition backend will do the autobatching, and in an enterprise deployment the gRPC loadbalancer will split it across multiple GPU nodes 🙂
Hi @<1547028116780617728:profile|TimelyRabbit96>
Trying to do model inference on a video, so first step in
Preprocess
class is to extract frames.
Basically this depends on the RestAPI, usually would will be sending a link to data to be processed and returned Synchronously
What you should have a custom endpoint doing the extraction, send Raw data into another endpoint doing the model inference, basically think "pipeline" end points:
None
I see, actually what you should do is a fully custom endpoint,
- preprocessing -> doenload video
- processing -> extract frames and send them to Triton with gRPC (see below how)
- post processing, return a human readable answer
Regrading the processing itself, what you need is to take this function (copy paste):
None
have it as internal_process(numpy_frame)
and then have something along the lines of this pseudo code
def process(...):
results_batch = []
for frame in my_video_frame_extractor(file_name_here)
np_frame = np.array(frame)
result = self.executor.submit(self._process, data=np_frame)
results_batch += [result]
if len(results_batch) == BATCH_SIZE:
# collect all the results back
# and clear the batch
results_batch = []
This will scale horizontally the GPU pods, as well as autobatch the inference 🙂
can we use a currently setup virtualenv by any chance?
You mean, if the cleamrl-agent needs to setup a new venv each time? are you running in docker mode ?
(by default it is caching the venv so the second time it is using a precached full venv, installing nothing)
@<1560074028276781056:profile|HealthyDove84> if you want you can PR a fix, it should be very simple basically:
None
elif np_dtype == str:
return "STRING"
elif np_dtype == np.object_ or np_dtype.type == np.bytes_:
return "BYTES"
return None
Perfect, thank you so much!! 🙏
@<1560074028276781056:profile|HealthyDove84> This is how we’d tackle the video-to-frame ratio issue
Gotcha, thanks a lot @<1523701205467926528:profile|AgitatedDove14> . One issue that I see is that the Dockerfile inside the agent container is what's being used and doesn't seem like it can be replaced by any of these:
CLEARML_AGENT_DEFAULT_BASE_DOCKER: "nvidia/cuda:11.6.1-runtime-ubuntu20.04"
TRAINS_AGENT_DEFAULT_BASE_DOCKER: "nvidia/cuda:11.6.1-runtime-ubuntu20.04"
TRAINS_DOCKER_IMAGE: "nvidia/cuda:11.6.1-runtime-ubuntu20.04"
Are we missing something?
Nevermind, figured it out, it was using a cached container for some reason 🙂
notice that even inside docker the venv is cached on the host machine 🙂
Thanks for your reponse @<1523701205467926528:profile|AgitatedDove14> , this would be from the model. Something like the TYPE_STRING that Triton accepts.
and of course if your docker has packages preinstalled they are automatically used (not reinstalled)
Hi @<1523701205467926528:profile|AgitatedDove14> , thanks for the always-fast response! 🙂
Yep so I am sending a link to a S3 bucket, and setup Triton ensemble within clearml-serving.
This is the gist of what i’m doing:
so essentially i am sending raw data, but i can only send the first 8 frames (L45) since i can’t really send the data in a list or something?
One issue that I see is that the Dockerfile inside the agent container
Not sure I follow, these are settings for the default container to be used when the agent spins a Task for you.
How are you running the agent itself ?
@<1523701205467926528:profile|AgitatedDove14> Regarding the clearml-agent, can we use a currently setup virtualenv by any chance?
I see, trying to A/B test the virtualenv vs docker.
So actually while we’re at it, we also need to return back a string from the model, which would be where the results are uploaded to (S3).
Is this being returned from your Triton Model? or the pre/post processing code?
Great, will try that, thanks @<1523701205467926528:profile|AgitatedDove14> !
I see, very interesting. I know this is a psedo-code, but are you suggesting sending the requests to Triton frame-by-frame?
Or perhaps np_frame = np.array(frame)
itself could be a slice of the total_frames
?
Like:
Dataset: [700, x, y, 3]
Batch: [8, x, y, 3]
I think that makes sense, and in the end deploy this endpoint like the pipeline example.
Notice this is per frame (single) not per 8
So actually while we’re at it, we also need to return back a string from the model, which would be where the results are uploaded to (S3).
I was able to send back a URL with Triton directly, but the input/output shape mapping doesn’t seem to support strings in Clearml. I have opened an issue for it: None
Am i missing something?
Or we need to setup the dependencies every time the experiment is run?
Something like the TYPE_STRING that Triton accepts.
I saw the github issue, this is so odd , look at the triton python package:
https://github.com/triton-inference-server/client/blob/4297c6f5131d540b032cb280f1e[…]1fe2a0744f8e1/src/python/library/tritonclient/utils/init.py