Reputation
Badges 1
132 × Eureka!My other question is: how does it decide what to upload automatically? It picked up almost everything, just not trainer_state.json. Which I'm actually not quite sure is necessary
Presumably the correct way to do this is to fork the transformers library, make the change, and add that version to my requirements.txt
CostlyOstrich36 I get some weird results, for "active duration".
For example, several of the experiments show that their active duration is more than 90 days, but I definitely didn't run them that long.
Alas, no luck. Uploaded the same things, did not upload trainer_state.json
As an alternate solution, if I could group runs and get stats across the group, that would be cool
Oh look, the blue setting is best!
Oh, that's cool, didn't know about that:
something like this is what I'm looking for
Well, in my particular case the training data's got, like 200 subfolders, each with 2,000 files. I was just curious whether it was possible to pull down one of the subsets
No, not specifically 20,in fact more than 20
AgitatedDove14 yes I see the scalars. Attached screenshot
Code to reproduce: I'll try to come up with a sample you will be able to run. But the code we're using is basically just https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/examples/pytorch/language-modeling/run_mlm.py
OK, so if I've got, like, 2x16GB GPUs and 2x32GB I could allocate all the 16GB GPUs to one Queue? And all the 32GB ones to another?
I know the documentation says that you can give it a wildcard or pathlib Path - but I'm still not quite sure exactly how to tell it "top-level files only, not subfolders".
Oh, I forgot to mention: pip install tensorboard also
It would certainly be nice to have. Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.
Well, I can just work around it now that I know, by creating a folder with no subfolders and uploading that. But... 🤔 perhaps allow the interface to take in a list or generator? As in,files_to_upload = [f for f in output_dir.glob("*") if f.is_file()] Task.current_task().upload_artifact( "best_checkpoint", artifact_object=files_to_upload)
And then it could zip up the list and name it "best_checkpoint"?
Yeah! So if given a folder, it adds everything in the folder. But if given a list or iterable, it iterates over the Paths and zips them all up.
Oh yeah, that's been bugging me for a while
Or at least not conveniently
Then when I queue up a job on the 1x16gb
queue it would run on one of the two GPUs?
We do have the paid tier, I believe. Anywhere we can go and read up some more on this stuff, btw?
Oh, that's a neat tip! I just set that in the Task settings? I didn't know that was possible
I think the model state is just post training loop (not inside the loop), no?
trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.
My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc
Could I use "register artifact" to get it to update every time there's a new checkpoint created?
I'm not sure I follow. Can you elaborate what you mean? Pseudo stack?