Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
no requests are being served as in there is no traffic indeed
I can't be sure of the version I can't check at the moment, I have 1.3.0 from the top of my head but could be way off
what is actually setting the task status to Aborted
?
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
so i still can't figure out what sets the task status to aborted
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug, but I will need to check the code to be certain
@<1523701205467926528:profile|AgitatedDove14> I experience the exact same behaviour for the clearml-serving (version 1.3.0). Status of the serving-task goes to Aborted, status message is also "Forced stop (non-responsive)" and also after a while of no incoming traffic
Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
Hi Martin, thanks a lot for looking into this so quickly. Will you let me know the version number once it's pushed? Thanks!
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
Hey Martin, I will, but it's a bit more tricky because we have modifications in the code that I have to merge on our side
build your containers off these two? or are you building directly from code ?
we are actually building from our fork of the code into our own images and helm charts
Sure, in that case, wait until tomorrow, when the github repo is fully synced
hey Marin real quick actually, on your update to the requirements.txt file isn't that constraint on fastapi inconsistent?
how can you be >= 0.109.1 and lower than 0.96
how can you be snyk and lower than 0.96
Yep Snyk
auto "patching" is great 🙂
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"
ok so I haven't looked at the latest changes after the sync this morning but the ones we put in yesterday seems to have fixed the issue, the service is still running this morning at least.
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞
ok great I ll check what other changes we have missed yesterday
We put back the additional changes and so far it seems that this has solved our issue. Thanks a lot for the quick turnaround on this.