Reputation
Badges 1
132 × Eureka!Yup! That works.from joeynmt.training import train train("transformer_epo_eng_bpe4000.yaml")
And it's tracking stuff successfully. Nice
This seems similar but not quite the thing I'm looking for: https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-1-setting-an-output-destination-for-model-checkpoints
generally I include the random seed in the name
Aggregating the sort of range of all the runs, maybe like a hurricane track?
Well they do all have different names
Actually at this point, I'd say it's too late, you might want to just generate new credentials...
Martin I found a different solution (hardcoding the parent tasks by hand), but I'm curious to hear what you discover!
Yup, I just wanted to mark it completed, honestly. But then when I run it, Colab crashes.
OK, so if I've got, like, 2x16GB GPUs and 2x32GB I could allocate all the 16GB GPUs to one Queue? And all the 32GB ones to another?
Hello! integration in what sense? Training a model? Uploading a model to the hub? Something else?
So presumably you could write a Python loop that goes through and pulls the metrics into a list, then make a plot locally. Not sure about creating a Dashboard within the ClearML web interface though!
I've been trying to do things like "color these five experiments one color, color these other five a different color", but then once I maximize the thing the colors all change
BroadCoyote44 you, uh, might want to delete the bit of your message with the secret key in it?
Or at least not conveniently
It seems to create a folder and put things into it, I was hoping to just observe the tensorboard folder
Oh, of course, that makes total sense
Oh, btw, I assume you mean http://clear.ml , not http://clearml.ml ?
So for example:
` {'output_dir': 'shiba_ner_trainer', 'overwrite_output_dir': False, 'do_train': True, 'do_eval': True, 'do_predict': True, 'evaluation_strategy': 'epoch', 'prediction_loss_only': False, 'per_device_train_batch_size': 16, 'per_device_eval_batch_size': 16, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 1, 'eval_accumulation_steps': None, 'learning_rate': 0.0004, 'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam...
Did a couple tests with Colab, moving the installs and imports up to the top. Results... seem to suggest that doing all the installs/imports before actually running the tokenization and such might fix the problem too?
It's a bit confusing. I made a couple cells at the top, like thus:!pip install clearml
andfrom clearml import Task task = Task.init(project_name="project name", task_name="Esperanto_Bert_2")
and# Check that PyTorch sees it import torch torch.cuda.is_available()
and
...
I will test both! Thanks for the ideas!
Hopefully it works for you, getting run_mlm.py to work took me some trial and error the first time. There is a --help option for the command line I believe. Some of the things aren't really intuitive
Then I gave that folder a name.
Oh, I forgot to mention: pip install tensorboard also
It would certainly be nice to have. Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.
TB = Tensorboard? No idea, I haven't tried to run it with tensorboard specifically. I do have tensorboard installed in the environment, I can confirm that.