Hi All, What Is The Best Way To Monitor Failer Clearml Agent That Kill All Tasks In Queue?

Answered

Hi all,
what is the best way to monitor failer clearml agent that kill all tasks in queue?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GaudyPig83
				
					0
					 × 1

Votes Newest

Answers 4

The thing is the agent does not fail - it's the task setup that fails... One approach is to monitor all tasks handled by that agent (although I'm not sure what will be the rule by which you decide). Another is to periodically send "test" tasks that are very short and test a specific (or all) setup pre-requisites, and monitor their status

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi, for example there ia mechine without "nvidia driver" on "yotam-mechine" ,
And "yotam mechine" is on queue "a".
There is 200 tasks on this queue.
So "yotam -mechine" will start task,and will failed.
And will get the next task and also will failed.
And will kill all the tasks in the queue.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GaudyPig83
				
					0
					 × 1

I think you should monitor your tasks and see what's going on. Also an agent should be set up in a way that you know it will work and has all the required drivers etc..

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1539780272512307200:profile|GaudyPig83> , I'm not sure I understand - what do you mean by failed clearml agent?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

4 Answers

2 years ago