Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke


It seems like the worker lost network connectivity, and then aborted the jobs 😞

2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.810789+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.810825+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.810876+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.874338+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.874350+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.874363+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.874375+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.903967+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.903979+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.903991+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.904002+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:16.101657+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:16.101692+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:16.101727+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:16.101761+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:23.867991+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:23.868026+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:23.868061+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:23.868097+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.117654+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.117688+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.117722+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.117756+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.313929+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.313964+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.313999+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.314043+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:25.319252+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:25.319286+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:25.319322+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:25.319364+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:26.322691+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:26.322725+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:26.322759+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:26.322793+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.retry_backoff_factor_sec = 10
  
  
Posted one day ago
1 Views
0 Answers
one day ago
12 hours ago