what is actually setting the task status to Aborted
?
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
We put back the additional changes and so far it seems that this has solved our issue. Thanks a lot for the quick turnaround on this.
no requests are being served as in there is no traffic indeed
Hey Martin, I will, but it's a bit more tricky because we have modifications in the code that I have to merge on our side
Hi Martin, thanks a lot for looking into this so quickly. Will you let me know the version number once it's pushed? Thanks!
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug, but I will need to check the code to be certain
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
build your containers off these two? or are you building directly from code ?
ok great I ll check what other changes we have missed yesterday
how can you be >= 0.109.1 and lower than 0.96
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞
Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
I can't be sure of the version I can't check at the moment, I have 1.3.0 from the top of my head but could be way off
so i still can't figure out what sets the task status to aborted
@<1523701205467926528:profile|AgitatedDove14> I experience the exact same behaviour for the clearml-serving (version 1.3.0). Status of the serving-task goes to Aborted, status message is also "Forced stop (non-responsive)" and also after a while of no incoming traffic
hey Marin real quick actually, on your update to the requirements.txt file isn't that constraint on fastapi inconsistent?
Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
we are actually building from our fork of the code into our own images and helm charts
how can you be snyk and lower than 0.96
Yep Snyk
auto "patching" is great 🙂
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"
Sure, in that case, wait until tomorrow, when the github repo is fully synced
ok so I haven't looked at the latest changes after the sync this morning but the ones we put in yesterday seems to have fixed the issue, the service is still running this morning at least.