This is basically what I follow for setting up my own Triton server:
This is the gist of our current setup using the recommended approach
Hey ClearML community. A while back I was asking how one can perform inference on a video with clearml-serving, which includes an ensemble, preprocessing, and postprocessing.
Back then @<1523701205467926528:profile|AgitatedDove14> suggested that we override the process()
function as well, and set it up so each frame is asynchronously sent to the model, by copy pasting the original process()
function here , and calling it _process()
and sending each frame individually, and eventually await
for it for every batch_size.
However, we’ve came across some serious performance issues compared to setting this up on vanila Triton.
I’m not entirely sure why, but the gRPC client setup that I’ve seen from the examples is different that the one used in ClearML-serving. For instance, each frame (image) takes ~2 seconds just to flatten()
( link ).
Overall, inference on clearML takes ~ 16seconds for a single batch (size=8) on ClearML using the above approach, and only like 0.2s on Triton . GPU usage is also substantially less and infrequent on the clearML side.
We’d like to continue and even improve this community , I just wanted to bring this up and brainstorm, and get any insights one might have. Thanks!
This is basically what I follow for setting up my own Triton server:
This is the gist of our current setup using the recommended approach