So, Here'S A Question. Does Clearml Automatically Save Everything Necessary To Continue Training A Pytorch Language Model? Specifically, I'Ve Been Looking At The Checkpoint Folders Created When I'M Training A Huggingface Robertaformaskedlm. I Checked What

Answered

So, here's a question. Does clearml automatically save everything necessary to continue training a pytorch language model? Specifically, I've been looking at the checkpoint folders created when I'm training a huggingface RobertaForMaskedLM. I checked what files are being saved in each checkpoint folder, and I find that the following files are different each time:
rng_state.pth http://scheduler.pt trainer_state.json pytorch_model.bin http://optimizer.pt
And these are the ones that clearML is saving:
rng_state.pth http://scheduler.pt training_args.bin pytorch_model.bin http://optimizer.pt
So it's saving training_args.bin, but not saving trainer_state.json.

...is that OK?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Votes Newest

Answers 28

OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469

trainer.train() just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever triggers that, and upload this file also?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Oh, interesting

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

If you cannot change the "TrainerState" (i.e. inherit and pass it into the code)
you cloud also monkey-patch it, something like
` class OurTrainerState(TrainerState):
def init(...)
...
def load_from_json(cls, json_path: str):
super().load_from_json(json_path))
Task.current_task().upload_artifact(...)

trainer.state = OurTrainerState(trainer.state) `

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My other question is: how does it decide what to upload automatically? It picked up almost everything, just not trainer_state.json. Which I'm actually not quite sure is necessary

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Oh, here's an example, a screenshot I took of the files in my Colab instance:

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

training loop is within line 469, I think.

I think the model state is just post training loop (not inside the loop), no?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'll give it a shot!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

What I'm curious about is how clearML hooks into that to know to upload the other artifacts such as http://optimizer.pt .

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Sorry, you are correct this is where the json is created:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470

other links are the function calling it. make sense ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

OK, I added

Task.current_task().upload_artifact(name='trainer_state', artifact_object=os.path.join(output_dir, "trainer_state.json"))

after this line:

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer.py#L1531

And it seems to be working.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

lol TIL: https://stackoverflow.com/questions/5626193/what-is-monkey-patching

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Which is defined, it seems, here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L459

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I think the model state is just post training loop (not inside the loop), no?

trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.

My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Could I use "register artifact" to get it to update every time there's a new checkpoint created?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Oh, that's a neat tip! I just set that in the Task settings? I didn't know that was possible

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I'm not sure I follow. Can you elaborate what you mean? Pseudo stack?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_pt_utils.py#L954
specifically called here:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L480
Maybe after this line add:
Task.current_task().upload_artifact('trainer_state.json, name='state') `wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Basically it hooks into any torch.save function (monkey patching in realtime)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yep 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm pseudo stack:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L779

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L285

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could I use "register artifact"

I think this is somewhat deprecated and we should probably replace it with something similar to what you mentioned (i.e. watch a file change).
Right now the easiest way would e to manually upload the trainer_state.json every checkpoint:
Task.current_task().upload_artifact('trainer_state.json, name='state') `

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I guess I could try and edit that, somehow. Hmm

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

So in theory we could hook into one of those functions and add a line to have ClearML upload that particular json we want

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Presumably the correct way to do this is to fork the transformers library, make the change, and add that version to my requirements.txt

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Alas, no luck. Uploaded the same things, did not upload trainer_state.json

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

oooh, that's awesome lol. Never thought to do it that way

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Hi SmallDeer34
The any generally any pytorch.save(...) is logged/uploaded by clearml automatically. specifically in your case I think the only missing one is the trainer_sate.json, which I assume is general json file, and I imagine is part of huggingface framework. You can easily upload it as additional artifact with Task.upload_artifact wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yeah, we don't even get to line 480, all the training loop is within line 469, I think.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Write your answer

1K Views

28 Answers

3 years ago

one year ago