Reputation
Badges 1
132 × Eureka!Yes, it trains fine. I can even look at the console output
Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):
edit: where I ran it after resetting
not much different from the HuggingFace version, I believe
Before I enqueued the job, I manually edited Installed Packages thus:boto3 datasets clearml tokenizers torch
and addedpip install git+
to the setup script.
And the docker image isnvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04
I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500
Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.
SuccessfulKoala55 the clearml version on the server, according to my colleague, is:clearml-agent --version CLEARML-AGENT version 1.0.0
As in, I edit Installed Packages, delete everything there, and put that particular list of packages.
it's one where I reset it, and cleared out the Installed Packages to only have transformers @ git+
https://github.com/huggingface/transformers@61c506349134db0a0a2fd6fb2eff8e29a2f84e79 in it.
SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.
IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?
I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?
Well, in my particular case the training data's got, like 200 subfolders, each with 2,000 files. I was just curious whether it was possible to pull down one of the subsets
It would certainly be nice to have. Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.
There's also https://allegro.ai/clearml/docs/rst/references/clearml_python_ref/task_module/task_task.html
SuccessfulKoala55 what's the difference between the two websites? Is one of them preferred?
Ah, makes sense! Have you considered adding a "this is the old website! Click here to get to the new one!" banner, kinda like on docs for python2 functions? https://docs.python.org/2.7/library/string.html
OK, so if I've got, like, 2x16GB GPUs and 2x32GB I could allocate all the 16GB GPUs to one Queue? And all the 32GB ones to another?
We do have the paid tier, I believe. Anywhere we can go and read up some more on this stuff, btw?
So for example:
` {'output_dir': 'shiba_ner_trainer', 'overwrite_output_dir': False, 'do_train': True, 'do_eval': True, 'do_predict': True, 'evaluation_strategy': 'epoch', 'prediction_loss_only': False, 'per_device_train_batch_size': 16, 'per_device_eval_batch_size': 16, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 1, 'eval_accumulation_steps': None, 'learning_rate': 0.0004, 'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam...
OK, I guess
` training_args_dict = training_args.to_dict()
Task.current_task().set_parameters_as_dict(training_args_dict) `works, but how to change the name from "General"?
No, they're not in Tensorboard
It's not a big deal because it happens after I'm done with everything, I can just reset the Colab runtime and start over
Yup, I just wanted to mark it completed, honestly. But then when I run it, Colab crashes.
I suppose I could upload 200 different "datasets", rather than one dataset with 200 folders in it, but then clearml-data search
would have 200 entries in it? It seemed like a good idea to put them all in one at the time
Oh, of course, that makes total sense
Yup! That works.from joeynmt.training import train train("transformer_epo_eng_bpe4000.yaml")
And it's tracking stuff successfully. Nice