Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke

Unanswered

Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages 🙂 . Do you have any suggestions on how we can start tracking down the cause of this?

This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user . The first errors on the agent logs appeared at 06:56:01.

I asked our HPC folks and they were not able to see any obvious network dropouts on other servers in the same location. Our DevOps eng also didn't see anything happen in Kube at that time. Looking at the uptime of the server & and the agent pods/machines, neither have rebooted in the time since this issue.

This might be a tricky one to track down since we've only seen it a handful of times...

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

177 Views

0 Answers

10 months ago