Unanswered
Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs.
For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke
@<1724960464275771392:profile|DepravedBee82> the agent (and SDK) will wait for quite some time and retry even if a server is not available. The printout you've attached is not something the agent or SDK print out - is this something your code prints? In general, I can easily test that (and just did 🙂 ) by running an agent with a task and simply disconnecting the network cable - the agent will keep trying for a very long time before giving up (backoff times keep increasing, and the max retries for network connectivity is 254 by default).
By the way, the sdk.network.iteration.max_retries_on_server_error
is not actually used by the clearml
python package, only by the ClearML enterprise python package
20 Views
0
Answers
one month ago
one month ago