ok great I ll check what other changes we have missed yesterday
Hi Martin, thanks a lot for looking into this so quickly. Will you let me know the version number once it's pushed? Thanks!
we are actually building from our fork of the code into our own images and helm charts
how can you be >= 0.109.1 and lower than 0.96
what is actually setting the task status to Aborted
?
Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
so i still can't figure out what sets the task status to aborted
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
@<1523701205467926528:profile|AgitatedDove14> I experience the exact same behaviour for the clearml-serving (version 1.3.0). Status of the serving-task goes to Aborted, status message is also "Forced stop (non-responsive)" and also after a while of no incoming traffic
Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
hey Marin real quick actually, on your update to the requirements.txt file isn't that constraint on fastapi inconsistent?
how can you be snyk and lower than 0.96
Yep Snyk
auto "patching" is great 🙂
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"
I can't be sure of the version I can't check at the moment, I have 1.3.0 from the top of my head but could be way off
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞
We put back the additional changes and so far it seems that this has solved our issue. Thanks a lot for the quick turnaround on this.
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug, but I will need to check the code to be certain
ok so I haven't looked at the latest changes after the sync this morning but the ones we put in yesterday seems to have fixed the issue, the service is still running this morning at least.
Hey Martin, I will, but it's a bit more tricky because we have modifications in the code that I have to merge on our side
no requests are being served as in there is no traffic indeed
build your containers off these two? or are you building directly from code ?
Sure, in that case, wait until tomorrow, when the github repo is fully synced