So, Here'S A Question. Does Clearml Automatically Save Everything Necessary To Continue Training A Pytorch Language Model? Specifically, I'Ve Been Looking At The Checkpoint Folders Created When I'M Training A Huggingface Robertaformaskedlm. I Checked What

Unanswered

I think the model state is just post training loop (not inside the loop), no?

trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.

My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

237 Views

0 Answers

3 years ago

2 years ago