Unanswered
Hi Good Folks Here! Does Clearml Allow Auto-Rerun Of Failed Jobs, For Example When A Spot Instance Gets Interrupted, Please? (Or Auto-Resume, If Checkpointing Logic In Place)
Hi @<1546665634195050496:profile|SolidGoose91> , I think this capability exists when running pipelines. The pipeline controller will detect spot instances that failed and will retry running them.
Are you using the PRO or the open source auto scaler?
171 Views
0
Answers
one year ago
one year ago