What I'm curious about is how clearML hooks into that to know to upload the other artifacts such as http://optimizer.pt .
OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469
trainer.train()
just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever triggers that, and upload this file also?
training loop is within line 469, I think.
I think the model state is just post training loop (not inside the loop), no?
Sorry, you are correct this is where the json is created:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470
other links are the function calling it. make sense ?
Could I use "register artifact" to get it to update every time there's a new checkpoint created?
Which is defined, it seems, here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L459
OK, I added
Task.current_task().upload_artifact(name='trainer_state', artifact_object=os.path.join(output_dir, "trainer_state.json"))
after this line:
And it seems to be working.
Hi SmallDeer34
The any generally any pytorch.save(...) is logged/uploaded by clearml
automatically. specifically in your case I think the only missing one is the trainer_sate.json, which I assume is general json file, and I imagine is part of huggingface framework. You can easily upload it as additional artifact with Task.upload_artifact
wdyt?
I'm not sure I follow. Can you elaborate what you mean? Pseudo stack?
I guess I could try and edit that, somehow. Hmm
Oh, that's a neat tip! I just set that in the Task settings? I didn't know that was possible
If you cannot change the "TrainerState" (i.e. inherit and pass it into the code)
you cloud also monkey-patch it, something like
` class OurTrainerState(TrainerState):
def init(...)
...
def load_from_json(cls, json_path: str):
super().load_from_json(json_path))
Task.current_task().upload_artifact(...)
trainer.state = OurTrainerState(trainer.state) `
Presumably the correct way to do this is to fork the transformers library, make the change, and add that version to my requirements.txt
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_pt_utils.py#L954
specifically called here:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L480
Maybe after this line add:Task.current_task().upload_artifact('trainer_state.json
, name='state') `wdyt?
I think the model state is just post training loop (not inside the loop), no?
trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.
My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc
So in theory we could hook into one of those functions and add a line to have ClearML upload that particular json we want
Alas, no luck. Uploaded the same things, did not upload trainer_state.json
oooh, that's awesome lol. Never thought to do it that way
Basically it hooks into any torch.save function (monkey patching in realtime)
Could I use "register artifact"
I think this is somewhat deprecated and we should probably replace it with something similar to what you mentioned (i.e. watch a file change).
Right now the easiest way would e to manually upload the trainer_state.json
every checkpoint:Task.current_task().upload_artifact('trainer_state.json
, name='state') `
Yeah, we don't even get to line 480, all the training loop is within line 469, I think.
Oh, here's an example, a screenshot I took of the files in my Colab instance:
My other question is: how does it decide what to upload automatically? It picked up almost everything, just not trainer_state.json. Which I'm actually not quite sure is necessary