I assume you’re using a self-hosted server?
Yes
Hi JitteryCoyote63 ,
The clearml-server asked the clearml-agent to stop the task because it didn’t got anything for a long time?
Seems so - there's a "non-responsive tasks" watchdog on the server in charge of doing exactly that. I assume you're using a self-hosted server?
Well, if the task was indeed running, it's strange that it was stopped since tasks have a thread that is in charge of pinging the server to make sure the server knows they're still running, so maybe there was some network issue?
In any case, the watchdog setting can be controlled using the services.tasks.non_responsive_tasks_watchdog.threshold_sec
server configuration setting (default is 7200 seconds)
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)