So, Here'S A Question. Does Clearml Automatically Save Everything Necessary To Continue Training A Pytorch Language Model? Specifically, I'Ve Been Looking At The Checkpoint Folders Created When I'M Training A Huggingface Robertaformaskedlm. I Checked What

Answered

So, here's a question. Does clearml automatically save everything necessary to continue training a pytorch language model? Specifically, I've been looking at the checkpoint folders created when I'm training a huggingface RobertaForMaskedLM. I checked what files are being saved in each checkpoint folder, and I find that the following files are different each time:
rng_state.pth http://scheduler.pt trainer_state.json pytorch_model.bin http://optimizer.pt
And these are the ones that clearML is saving:
rng_state.pth http://scheduler.pt training_args.bin pytorch_model.bin http://optimizer.pt
So it's saving training_args.bin, but not saving trainer_state.json.

...is that OK?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Votes Newest

Answers 28

OK, I added

Task.current_task().upload_artifact(name='trainer_state', artifact_object=os.path.join(output_dir, "trainer_state.json"))

after this line:

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer.py#L1531

And it seems to be working.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

My other question is: how does it decide what to upload automatically? It picked up almost everything, just not trainer_state.json. Which I'm actually not quite sure is necessary

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I guess I could try and edit that, somehow. Hmm

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I'm not sure I follow. Can you elaborate what you mean? Pseudo stack?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Yeah, we don't even get to line 480, all the training loop is within line 469, I think.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I'll give it a shot!

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

yep 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

training loop is within line 469, I think.

I think the model state is just post training loop (not inside the loop), no?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If you cannot change the "TrainerState" (i.e. inherit and pass it into the code)
you cloud also monkey-patch it, something like
` class OurTrainerState(TrainerState):
def init(...)
...
def load_from_json(cls, json_path: str):
super().load_from_json(json_path))
Task.current_task().upload_artifact(...)

trainer.state = OurTrainerState(trainer.state) `

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So in theory we could hook into one of those functions and add a line to have ClearML upload that particular json we want

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Which is defined, it seems, here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L459

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Could I use "register artifact"

I think this is somewhat deprecated and we should probably replace it with something similar to what you mentioned (i.e. watch a file change).
Right now the easiest way would e to manually upload the trainer_state.json every checkpoint:
Task.current_task().upload_artifact('trainer_state.json, name='state') `

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_pt_utils.py#L954
specifically called here:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L480
Maybe after this line add:
Task.current_task().upload_artifact('trainer_state.json, name='state') `wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh, that's a neat tip! I just set that in the Task settings? I didn't know that was possible

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I think the model state is just post training loop (not inside the loop), no?

trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.

My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Basically it hooks into any torch.save function (monkey patching in realtime)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm pseudo stack:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L779

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L285

https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Presumably the correct way to do this is to fork the transformers library, make the change, and add that version to my requirements.txt

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

lol TIL: https://stackoverflow.com/questions/5626193/what-is-monkey-patching

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Could I use "register artifact" to get it to update every time there's a new checkpoint created?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Sorry, you are correct this is where the json is created:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470

other links are the function calling it. make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

What I'm curious about is how clearML hooks into that to know to upload the other artifacts such as http://optimizer.pt .

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Alas, no luck. Uploaded the same things, did not upload trainer_state.json

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

oooh, that's awesome lol. Never thought to do it that way

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Oh, interesting

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469

trainer.train() just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever triggers that, and upload this file also?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Hi SmallDeer34
The any generally any pytorch.save(...) is logged/uploaded by clearml automatically. specifically in your case I think the only missing one is the trainer_sate.json, which I assume is general json file, and I imagine is part of huggingface framework. You can easily upload it as additional artifact with Task.upload_artifact wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh, here's an example, a screenshot I took of the files in my Colab instance:

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Write your answer

1K Views

28 Answers

3 years ago

2 years ago