Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey, I Want To Use The Aws Autoscaler With Spot Instances, And I Was Wondering How (Or If) You Handle Interruptions. What We Currently Implemented Is A Mechanism That On Spot Failure Reruns The Training With A Flag, And Our Code Knows To Search For The La

Hey, I want to use the AWS autoscaler with spot instances, and I was wondering how (or if) you handle interruptions. What we currently implemented is a mechanism that on spot failure reruns the training with a flag, and our code knows to search for the latest checkpoint and resume from it. But this, of course, is not on ClearML. Do you have any handling, or any way we can connect the two systems?

  
  
Posted 3 years ago
Votes Newest

Answers 3


Are there any services OOB like this?

On the open-source, I can't recall any but will probably be easy to write. Paid tier might have an offering though, not sure 🙂

  
  
Posted 3 years ago

yeah, totally. Are there any services OOB like this?

  
  
Posted 3 years ago

Hi CleanPigeon16

I was wondering how (or if) you handle interruptions.

Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in your case is a service that checks that a Task was aborted and then re-enqueues it to the same queue (which will trigger the auto scaler to spin a new instance if needed)
Make sense ?

  
  
Posted 3 years ago
978 Views
3 Answers
3 years ago
2 years ago
Tags