Unanswered
So, Here'S A Question. Does Clearml Automatically Save Everything Necessary To Continue Training A Pytorch Language Model? Specifically, I'Ve Been Looking At The Checkpoint Folders Created When I'M Training A Huggingface Robertaformaskedlm. I Checked What
I think the model state is just post training loop (not inside the loop), no?
trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.
My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc
174 Views
0
Answers
3 years ago
one year ago