Reputation
Badges 1
132 × Eureka!Oh, btw, I assume you mean http://clear.ml , not http://clearml.ml ?
OK, I guess
` training_args_dict = training_args.to_dict()
Task.current_task().set_parameters_as_dict(training_args_dict) `works, but how to change the name from "General"?
So for example:
` {'output_dir': 'shiba_ner_trainer', 'overwrite_output_dir': False, 'do_train': True, 'do_eval': True, 'do_predict': True, 'evaluation_strategy': 'epoch', 'prediction_loss_only': False, 'per_device_train_batch_size': 16, 'per_device_eval_batch_size': 16, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 1, 'eval_accumulation_steps': None, 'learning_rate': 0.0004, 'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam...
Reproduce the training:# How to run
`
You need to pip install requirements first. I think the following would do: transformers datasets clearml tokenizers torch
CLEAR_DATA has train.txt and validation.txt, the .txt files just need to have text data on separate lines. For debugging, anything should do.
For training you need tokenizer files as well, vocab.json, merges.txt, and tokenizer.json.
you also need a config.json,
should work.
export CLEAR_DATA="./data/dataset_for...
TB = Tensorboard? No idea, I haven't tried to run it with tensorboard specifically. I do have tensorboard installed in the environment, I can confirm that.
AgitatedDove14 yes I see the scalars. Attached screenshot
Code to reproduce: I'll try to come up with a sample you will be able to run. But the code we're using is basically just https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/examples/pytorch/language-modeling/run_mlm.py
I know it's running these lines, which get defined in https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/src/transformers/trainer_pt_utils.py#L828trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics)
Hmm, I tried publishing one and it doesn't seem to have worked quite that easily: https://app.pro.clear.ml/projects/b4a1875539cb4d9798529439801402ee/experiments/6f4cb4718c7c4a25b3a041c63f6ff2b4/execution?columns=selected&columns=type&columns=last_iteration&columns=hyperparams.Args.num_train_epochs&columns=name&columns=status&columns=users&columns=started&columns=last_update&columns=tags&columns=parent.name&columns=project.name&columns=m.2eed1fe0db36d674643b5f84d2adf46e.06eaeb413e7213cb8b5419...
AgitatedDove14 yes, I called init and tensorboard is installed. It successfully uploaded the metrics from trainer.train(), just not from the next cell where we do trainer.predict
This seems to work:
` from clearml import Logger
for test_metric in posttrain_metrics:
print(test_metric, posttrain_metrics[test_metric])
#report_scalar(title, series, value, iteration)
Logger.current_logger().report_scalar("test", test_metric, posttrain_metrics[test_metric], 0) `
No, they're not in Tensorboard
What I'm curious about is how clearML hooks into that to know to upload the other artifacts such as http://optimizer.pt .
But then I took out all my additions except for pip install clearml
andfrom clearml import Task task = Task.init(project_name="project name", task_name="Esperanto_Bert_2")
and now I'm not getting the error? But it's still install 1.02. So I'm just thoroughly confused at this point. I'm going to start with a fresh cop of the original colab notebook from https://huggingface.co/blog/how-to-train
Or at least not conveniently
Did a couple tests with Colab, moving the installs and imports up to the top. Results... seem to suggest that doing all the installs/imports before actually running the tokenization and such might fix the problem too?
It's a bit confusing. I made a couple cells at the top, like thus:!pip install clearml
andfrom clearml import Task task = Task.init(project_name="project name", task_name="Esperanto_Bert_2")
and# Check that PyTorch sees it import torch torch.cuda.is_available()
and
...
Could I use "register artifact" to get it to update every time there's a new checkpoint created?
Oh, that's a neat tip! I just set that in the Task settings? I didn't know that was possible
CostlyOstrich36 nice, thanks for the link. I know that in "info" on the experiments dashboard it includes gpu_type and started/completed times, I'll give it a go based on that
AgitatedDove14 I should have probably expanded my last message a bit more to say "Right, natanM, right now it's on http://app.pro.clear.ml , not http://app.clear.ml , can you advise, given that it is on .pro?"
This seems similar but not quite the thing I'm looking for: https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-1-setting-an-output-destination-for-model-checkpoints
Hang on, CostlyOstrich36 I just noticed that there's a "project compute time" on the dashboard? Do you know how that is calculated/what that is?
OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469
trainer.train()
just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever ...
My other question is: how does it decide what to upload automatically? It picked up almost everything, just not trainer_state.json. Which I'm actually not quite sure is necessary
Alas, no luck. Uploaded the same things, did not upload trainer_state.json