The thing is the agent does not fail - it's the task setup that fails... One approach is to monitor all tasks handled by that agent (although I'm not sure what will be the rule by which you decide). Another is to periodically send "test" tasks that are very short and test a specific (or all) setup pre-requisites, and monitor their status
Hi, for example there ia mechine without "nvidia driver" on "yotam-mechine" ,
And "yotam mechine" is on queue "a".
There is 200 tasks on this queue.
So "yotam -mechine" will start task,and will failed.
And will get the next task and also will failed.
And will kill all the tasks in the queue.
I think you should monitor your tasks and see what's going on. Also an agent should be set up in a way that you know it will work and has all the required drivers etc..
Hi @<1539780272512307200:profile|GaudyPig83> , I'm not sure I understand - what do you mean by failed clearml agent?