Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
we are actually building from our fork of the code into our own images and helm charts
ok great I ll check what other changes we have missed yesterday
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug, but I will need to check the code to be certain
so i still can't figure out what sets the task status to aborted
We put back the additional changes and so far it seems that this has solved our issue. Thanks a lot for the quick turnaround on this.
how can you be >= 0.109.1 and lower than 0.96
Sure, in that case, wait until tomorrow, when the github repo is fully synced
no requests are being served as in there is no traffic indeed
Hey Martin, I will, but it's a bit more tricky because we have modifications in the code that I have to merge on our side
@<1523701205467926528:profile|AgitatedDove14> I experience the exact same behaviour for the clearml-serving (version 1.3.0). Status of the serving-task goes to Aborted, status message is also "Forced stop (non-responsive)" and also after a while of no incoming traffic
how can you be snyk and lower than 0.96
Yep Snyk
auto "patching" is great 🙂
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"
Hi Martin, thanks a lot for looking into this so quickly. Will you let me know the version number once it's pushed? Thanks!
ok so I haven't looked at the latest changes after the sync this morning but the ones we put in yesterday seems to have fixed the issue, the service is still running this morning at least.
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
what is actually setting the task status to Aborted
?
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞
build your containers off these two? or are you building directly from code ?
I can't be sure of the version I can't check at the moment, I have 1.3.0 from the top of my head but could be way off
hey Marin real quick actually, on your update to the requirements.txt file isn't that constraint on fastapi inconsistent?