Unanswered
Hi, Did Anyone Experiment With Running On The Aws Autoscaler On Spots And Knows Whether There Is Configuration For Retry Policy When Spot Get Evacuated Mid-Job?
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
176 Views
0
Answers
2 years ago
one year ago