Hopefully it works for you, getting run_mlm.py to work took me some trial and error the first time. There is a --help option for the command line I believe. Some of the things aren't really intuitive
Oh, I forgot to mention: pip install tensorboard also
TB = Tensorboard? No idea, I haven't tried to run it with tensorboard specifically. I do have tensorboard installed in the environment, I can confirm that.
I know it's running these lines, which get defined in https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/src/transformers/trainer_pt_utils.py#L828trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics)
Hi SmallDeer34
Can you see it in TB ? and if so where ?
AgitatedDove14 yes I see the scalars. Attached screenshot
Code to reproduce: I'll try to come up with a sample you will be able to run. But the code we're using is basically just https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/examples/pytorch/language-modeling/run_mlm.py
This should work. It has the tokenizer files, the train.txt, the validation.txt and a config.json
quick question:CLEAR_DATA="./data/dataset_for_modeling"
Should I pass the folder of the extracted zip file (assuming train.txt is the training dataset) ?
Clearml automatically gets these reported metrics from TB, since you mentioned see the scalars , I assume huggingface reports to TB. Could you verify? Is there a quick code sample to reproduce?
Thanks SmallDeer34 !
This is exactly what I needed
Yeah that should work. Basically in --train_file
it needs the path to train.txt, --validation_file
needs the path to validation.txt, etc. I just put them all in the same folder for convenience
Reproduce the training:# How to run
`
You need to pip install requirements first. I think the following would do: transformers datasets clearml tokenizers torch
CLEAR_DATA has train.txt and validation.txt, the .txt files just need to have text data on separate lines. For debugging, anything should do.
For training you need tokenizer files as well, vocab.json, merges.txt, and tokenizer.json.
you also need a config.json,
should work.
export CLEAR_DATA="./data/dataset_for_modeling"
python3 run_mlm.py
--line_by_line
--seed 420
--config_name "$CLEAR_DATA"
--tokenizer_name "$CLEAR_DATA"
--train_file "$CLEAR_DATA/train.txt"
--validation_file "$CLEAR_DATA/validation.txt"
--max_seq_length 512
--do_train
--do_eval
--evaluation_strategy steps
--eval_steps 500
--save_strategy epoch
--num_train_epochs 3
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--output_dir ./output/mlm_training_output `Let me get you a dataset