Hi @<1639799308809146368:profile|TritePigeon86> , if a task (and its agent) are terminated mid-run, there's no way for the system to know that, only by enforcing a timeout on tasks that have not reported for a given period of time. The ClearML server does have this functionality, and tasks that have not reported for a predefined period of time (default is 2 hours) will be marked as aborted (with the non-responsive status in the task status message)
@<1523701087100473344:profile|SuccessfulKoala55> great! So that means It is possible to catch tasks with status aborted and reason non-responsive and retry them so they will come back to queue? also, how do I change the timeout in clearml server?
Hi @<1639799308809146368:profile|TritePigeon86> , apologies for missing this!
See configuration section here: None