Hey ClearML community. A while back I was asking how one can perform inference on a video with clearml-serving, which includes an ensemble, preprocessing, and postprocessing.
Back then @<1523701205467926528:profile|AgitatedDove14> suggested that we override the
process() function as well, and set it up so each frame is asynchronously sent to the model, by copy pasting the original
process() function here , and calling it
_process() and sending each frame individually, and eventually
await for it for every batch_size.
However, we’ve came across some serious performance issues compared to setting this up on vanila Triton.
I’m not entirely sure why, but the gRPC client setup that I’ve seen from the examples is different that the one used in ClearML-serving. For instance, each frame (image) takes ~2 seconds just to
flatten() ( link ).
Overall, inference on clearML takes ~ 16seconds for a single batch (size=8) on ClearML using the above approach, and only like 0.2s on Triton . GPU usage is also substantially less and infrequent on the clearML side.
We’d like to continue and even improve this community , I just wanted to bring this up and brainstorm, and get any insights one might have. Thanks!