Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Pipeline With Steps Currently Running On-Prem. I Want To Use Autoscaler With Spot Instances To Replace The On-Prem Machine. My Question Regards Identifying A Task Failure Due To Instance Being Terminated Mid-Task. Is There A Way To Differenti

Hi, I have a pipeline with steps currently running on-prem. I want to use AutoScaler with spot instances to replace the on-prem machine. my question regards identifying a task failure due to instance being terminated mid-task. Is there a way to differentiate between regular task fail and loss of the agent due to instance shutdown? if so, how do I catch it and where (in the step retry on failure, post execution, status change execution, etc)? what is the best-practice?

  
  
Posted 9 months ago
Votes Newest

Answers 3


Hi @<1639799308809146368:profile|TritePigeon86> , if a task (and its agent) are terminated mid-run, there's no way for the system to know that, only by enforcing a timeout on tasks that have not reported for a given period of time. The ClearML server does have this functionality, and tasks that have not reported for a predefined period of time (default is 2 hours) will be marked as aborted (with the non-responsive status in the task status message)

  
  
Posted 9 months ago

Hi @<1639799308809146368:profile|TritePigeon86> , apologies for missing this!
See configuration section here: None

  
  
Posted 8 months ago

@<1523701087100473344:profile|SuccessfulKoala55> great! So that means It is possible to catch tasks with status aborted and reason non-responsive and retry them so they will come back to queue? also, how do I change the timeout in clearml server?

  
  
Posted 9 months ago
668 Views
3 Answers
9 months ago
8 months ago
Tags