Unanswered
Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs.
For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke
It seems like the worker lost network connectivity, and then aborted the jobs 😞
2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.810789+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.810825+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.810876+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.874338+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.874350+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.874363+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.874375+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.903967+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.903979+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.903991+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.904002+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:16.101657+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:16.101692+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:16.101727+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:16.101761+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:23.867991+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:23.868026+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:23.868061+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:23.868097+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.117654+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.117688+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.117722+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.117756+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.313929+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.313964+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.313999+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.314043+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:25.319252+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:25.319286+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:25.319322+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:25.319364+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:26.322691+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:26.322725+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:26.322759+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:26.322793+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.retry_backoff_factor_sec = 10
20 Views
0
Answers
one month ago
one month ago