Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Guys, I Have Been Running The Clearml-Serving For A While Now And I Realize That From Time To Time After A Couple Of Hours The Serving Task (Control Plane) That Is Configured Through The Cli Goes Into Status Abort. This Happens Even Though All The Pods

Hi Guys,
I have been running the clearml-serving for a while now and I realize that from time to time after a couple of hours the serving task (control plane) that is configured through the cli goes into status Abort. This happens even though all the pods are healthy and the endpoints are processing correctly.
Any idea what happens and how to avoid that state (obviously from there on we can't access/control the service through the cli). Note, that if we spin up new serving pods ( by increasing number of replica for instance), the service comes back online as Running.

Note that the STATUS REASON under the INFO tab of the task says "Forced stop (non-responsive) but can't get more info than that.

Thanks!

  
  
Posted 10 months ago
Votes Newest

Answers 26


@<1523701205467926528:profile|AgitatedDove14> I experience the exact same behaviour for the clearml-serving (version 1.3.0). Status of the serving-task goes to Aborted, status message is also "Forced stop (non-responsive)" and also after a while of no incoming traffic

  
  
Posted 10 months ago

so they ping the werb server?

  
  
Posted 10 months ago

how can you be snyk and lower than 0.96

Yep Snyk auto "patching" is great 🙂
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"

  
  
Posted 10 months ago

hey Marin real quick actually, on your update to the requirements.txt file isn't that constraint on fastapi inconsistent?

  
  
Posted 10 months ago

my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default

  
  
Posted 10 months ago

Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?

This happens even though all the pods are healthy and the endpoints are processing correctly.

The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?

  
  
Posted 10 months ago

no requests are being served as in there is no traffic indeed

  
  
Posted 10 months ago

Hey Martin, I will, but it's a bit more tricky because we have modifications in the code that I have to merge on our side

  
  
Posted 10 months ago

build your containers off these two? or are you building directly from code ?

  
  
Posted 10 months ago

Woot woot, great to hear 🎊

  
  
Posted 9 months ago

we are actually building from our fork of the code into our own images and helm charts

  
  
Posted 10 months ago

ok great I ll check what other changes we have missed yesterday

  
  
Posted 9 months ago

ok so I haven't looked at the latest changes after the sync this morning but the ones we put in yesterday seems to have fixed the issue, the service is still running this morning at least.

  
  
Posted 9 months ago

Okay we have located the issue, thanks guys! We will push a patch release hopefully later today

  
  
Posted 10 months ago

so i still can't figure out what sets the task status to aborted

  
  
Posted 10 months ago

thanks for your reply!

  
  
Posted 10 months ago

@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞

  
  
Posted 9 months ago

what is actually setting the task status to Aborted ?

  
  
Posted 10 months ago

Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run

allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
  
  
Posted 10 months ago

how can you be >= 0.109.1 and lower than 0.96

  
  
Posted 10 months ago

Sure, in that case, wait until tomorrow, when the github repo is fully synced

  
  
Posted 10 months ago

I can't be sure of the version I can't check at the moment, I have 1.3.0 from the top of my head but could be way off

  
  
Posted 10 months ago

image

  
  
Posted 10 months ago

Hi Martin, thanks a lot for looking into this so quickly. Will you let me know the version number once it's pushed? Thanks!

  
  
Posted 10 months ago

We put back the additional changes and so far it seems that this has solved our issue. Thanks a lot for the quick turnaround on this.

  
  
Posted 9 months ago

no requests are being served as in there is no traffic indeed

It might be that it only pings when requests are served

what is actually setting the task status to

Aborted

?

server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it

my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default

Yeah.. let me check that
Basically this sounds like a sort of a bug, but I will need to check the code to be certain

  
  
Posted 10 months ago
691 Views
26 Answers
10 months ago
9 months ago
Tags
Similar posts