Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke

Hi all, we have clearml-server running on a kube pod, and then a GPU server running the clearml-agent which we use to queue jobs.

For some reason, our kube pod restarted (we're looking into why), but in the process of this happening all jobs on the worker were aborted ( Process terminated by user ). Obviously the first thing we need to fix is the kube issue, but is there any way of making the worker resilient to a network outage/dropped connection so that any currently running jobs are not automatically terminated? Maybe a timeout setting somewhere?

Edit: on closer inspection, the pod did not restart this morning, so something else is causing the processes to terminate

  
  
Posted one day ago
Votes Newest

Answers 2


It seems like the worker lost network connectivity, and then aborted the jobs 😞

2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.810789+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.810825+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.810876+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.874338+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.874350+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.874363+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.874375+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.903967+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.903979+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.903991+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.904002+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:16.101657+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:16.101692+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:16.101727+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:16.101761+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:23.867991+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:23.868026+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:23.868061+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:23.868097+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.117654+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.117688+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.117722+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.117756+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.313929+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.313964+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.313999+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.314043+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:25.319252+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:25.319286+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:25.319322+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:25.319364+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:26.322691+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:26.322725+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:26.322759+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:26.322793+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.retry_backoff_factor_sec = 10
  
  
Posted one day ago

@<1724960464275771392:profile|DepravedBee82> the agent (and SDK) will wait for quite some time and retry even if a server is not available. The printout you've attached is not something the agent or SDK print out - is this something your code prints? In general, I can easily test that (and just did 🙂 ) by running an agent with a task and simply disconnecting the network cable - the agent will keep trying for a very long time before giving up (backoff times keep increasing, and the max retries for network connectivity is 254 by default).
By the way, the sdk.network.iteration.max_retries_on_server_error is not actually used by the clearml python package, only by the ClearML enterprise python package

  
  
Posted 14 hours ago
12 Views
2 Answers
one day ago
12 hours ago
Tags