Hi, I Have A Pipeline With Steps Currently Running On-Prem. I Want To Use Autoscaler With Spot Instances To Replace The On-Prem Machine. My Question Regards Identifying A Task Failure Due To Instance Being Terminated Mid-Task. Is There A Way To Differenti

Answered

Hi, I have a pipeline with steps currently running on-prem. I want to use AutoScaler with spot instances to replace the on-prem machine. my question regards identifying a task failure due to instance being terminated mid-task. Is there a way to differentiate between regular task fail and loss of the agent due to instance shutdown? if so, how do I catch it and where (in the step retry on failure, post execution, status change execution, etc)? what is the best-practice?

  				
Posted 
	one year ago

					More  		
  Report
		
					TritePigeon86
				
					0
					 × 1

Votes Newest

Answers 3

Hi TritePigeon86 , apologies for missing this!
See configuration section here: None

  				
Posted 
	11 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 great! So that means It is possible to catch tasks with status aborted and reason non-responsive and retry them so they will come back to queue? also, how do I change the timeout in clearml server?

  				
Posted 
	one year ago

					More  		
  Report
		
					TritePigeon86
				
					0
					 × 1

Hi TritePigeon86 , if a task (and its agent) are terminated mid-run, there's no way for the system to know that, only by enforcing a timeout on tasks that have not reported for a given period of time. The ClearML server does have this functionality, and tasks that have not reported for a predefined period of time (default is 2 hours) will be marked as aborted (with the non-responsive status in the task status message)

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

866 Views

3 Answers

one year ago

11 months ago