Hi All, What Is The Best Way To Monitor Failer Clearml Agent That Kill All Tasks In Queue?

Answered

Hi all,
what is the best way to monitor failer clearml agent that kill all tasks in queue?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					GaudyPig83
				
					0
					 × 1

Votes Newest

Answers 4

The thing is the agent does not fail - it's the task setup that fails... One approach is to monitor all tasks handled by that agent (although I'm not sure what will be the rule by which you decide). Another is to periodically send "test" tasks that are very short and test a specific (or all) setup pre-requisites, and monitor their status

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1539780272512307200:profile|GaudyPig83> , I'm not sure I understand - what do you mean by failed clearml agent?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I think you should monitor your tasks and see what's going on. Also an agent should be set up in a way that you know it will work and has all the required drivers etc..

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi, for example there ia mechine without "nvidia driver" on "yotam-mechine" ,
And "yotam mechine" is on queue "a".
There is 200 tasks on this queue.
So "yotam -mechine" will start task,and will failed.
And will get the next task and also will failed.
And will kill all the tasks in the queue.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					GaudyPig83
				
					0
					 × 1

Write your answer

1K Views

4 Answers

one year ago