I was wondering how (or if) you handle interruptions.
Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in your case is a service that checks that a Task was aborted and then re-enqueues it to the same queue (which will trigger the auto scaler to spin a new instance if needed)
Make sense ?
Are there any services OOB like this?
On the open-source, I can't recall any but will probably be easy to write. Paid tier might have an offering though, not sure 🙂
yeah, totally. Are there any services OOB like this?