Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog
on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages 🙂 . Do you have any suggestions on how we can start tracking down the cause of this?
This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user
. The first errors on the agent logs appeared at 06:56:01.
I asked our HPC folks and they were not able to see any obvious network dropouts on other servers in the same location. Our DevOps eng also didn't see anything happen in Kube at that time. Looking at the uptime of the server & and the agent pods/machines, neither have rebooted in the time since this issue.
This might be a tricky one to track down since we've only seen it a handful of times...