@<1724960464275771392:profile|DepravedBee82> the agent (and SDK) will wait for quite some time and retry even if a server is not available. The printout you've attached is not something the agent or SDK print out - is this something your code prints? In general, I can easily test that (and just did 🙂 ) by running an agent with a task and simply disconnecting the network cable - the agent will keep trying for a very long time before giving up (backoff times keep increasing, and the max retries for network connectivity is 254 by default).
By the way, the sdk.network.iteration.max_retries_on_server_error
is not actually used by the clearml
python package, only by the ClearML enterprise python package
It seems like the worker lost network connectivity, and then aborted the jobs 😞
2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.810789+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.810825+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.810876+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.874338+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.874350+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.874363+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.874375+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.903967+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.903979+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.903991+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.904002+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:16.101657+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:16.101692+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:16.101727+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:16.101761+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:23.867991+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:23.868026+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:23.868061+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:23.868097+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.117654+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.117688+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.117722+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.117756+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.313929+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.313964+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.313999+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.314043+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:25.319252+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:25.319286+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:25.319322+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:25.319364+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:26.322691+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:26.322725+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:26.322759+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:26.322793+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.retry_backoff_factor_sec = 10