Hi JitteryCoyote63 ,
The clearml-server asked the clearml-agent to stop the task because it didn’t got anything for a long time?
Seems so - there's a "non-responsive tasks" watchdog on the server in charge of doing exactly that. I assume you're using a self-hosted server?
In any case, the watchdog setting can be controlled using the services.tasks.non_responsive_tasks_watchdog.threshold_sec
server configuration setting (default is 7200 seconds)
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
I assume you’re using a self-hosted server?
Yes
Well, if the task was indeed running, it's strange that it was stopped since tasks have a thread that is in charge of pinging the server to make sure the server knows they're still running, so maybe there was some network issue?