Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke

Hi all, we have clearml-server running on a kube pod, and then a GPU server running the clearml-agent which we use to queue jobs.

For some reason, our kube pod restarted (we're looking into why), but in the process of this happening all jobs on the worker were aborted ( Process terminated by user ). Obviously the first thing we need to fix is the kube issue, but is there any way of making the worker resilient to a network outage/dropped connection so that any currently running jobs are not automatically terminated? Maybe a timeout setting somewhere?

Edit: on closer inspection, the pod did not restart this morning, so something else is causing the processes to terminate

  
  
Posted one month ago
Votes Newest

Answers 6


Can you provide a screenshot of the ClearML Task's INFO panel?

  
  
Posted 5 days ago

It seems like the worker lost network connectivity, and then aborted the jobs 😞

2024-11-21T06:56:01.958962+00:00 mrl-plswh100 systemd-networkd-wait-online[2279529]: Timeout occurred while waiting for network connectivity.
2024-11-21T06:56:01.976055+00:00 mrl-plswh100 apt-helper[2279520]: E: Sub-process /lib/systemd/systemd-networkd-wait-online returned an error code (1)
2024-11-21T06:57:15.810747+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.810789+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.810825+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.810876+00:00 mrl-plswh100 clearml-agent[2304481]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.874338+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.874350+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.874363+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.874375+00:00 mrl-plswh100 clearml-agent[2304484]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:15.903967+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:15.903979+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:15.903991+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:15.904002+00:00 mrl-plswh100 clearml-agent[2304489]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:16.101657+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:16.101692+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:16.101727+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:16.101761+00:00 mrl-plswh100 clearml-agent[2304490]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:23.867991+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:23.868026+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:23.868061+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:23.868097+00:00 mrl-plswh100 clearml-agent[2305219]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.117654+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.117688+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.117722+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.117756+00:00 mrl-plswh100 clearml-agent[2305224]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:24.313929+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:24.313964+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:24.313999+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:24.314043+00:00 mrl-plswh100 clearml-agent[2305230]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:25.319252+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:25.319286+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:25.319322+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:25.319364+00:00 mrl-plswh100 clearml-agent[2305229]: sdk.network.iteration.retry_backoff_factor_sec = 10
2024-11-21T06:57:26.322691+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_threads = 4
2024-11-21T06:57:26.322725+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.metrics.file_upload_starvation_warning_sec = 120
2024-11-21T06:57:26.322759+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.max_retries_on_server_error = 5
2024-11-21T06:57:26.322793+00:00 mrl-plswh100 clearml-agent[2305231]: sdk.network.iteration.retry_backoff_factor_sec = 10
  
  
Posted one month ago

@<1724960464275771392:profile|DepravedBee82> the agent (and SDK) will wait for quite some time and retry even if a server is not available. The printout you've attached is not something the agent or SDK print out - is this something your code prints? In general, I can easily test that (and just did 🙂 ) by running an agent with a task and simply disconnecting the network cable - the agent will keep trying for a very long time before giving up (backoff times keep increasing, and the max retries for network connectivity is 254 by default).
By the way, the sdk.network.iteration.max_retries_on_server_error is not actually used by the clearml python package, only by the ClearML enterprise python package

  
  
Posted one month ago

Hi all, we're still suffering this issue where jobs are seemingly randomly aborted. The only clue is this in the ClearML logs:

2024-12-13 06:16:30  Process terminated by user

The only pattern we can see is that it typically happens around 6-7am.

Any suggestions on how to debug this would be greatly appreciated!

  
  
Posted 9 days ago

For reference, the clearml agent is running in its own user profile in Ubuntu 24.04 (so that it doesn't run as root as per previous discussions)

  
  
Posted 9 days ago

Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages 🙂 . Do you have any suggestions on how we can start tracking down the cause of this?

This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user . The first errors on the agent logs appeared at 06:56:01.

I asked our HPC folks and they were not able to see any obvious network dropouts on other servers in the same location. Our DevOps eng also didn't see anything happen in Kube at that time. Looking at the uptime of the server & and the agent pods/machines, neither have rebooted in the time since this issue.

This might be a tricky one to track down since we've only seen it a handful of times...

  
  
Posted one month ago
123 Views
6 Answers
one month ago
5 days ago
Tags