Reputation
Badges 1
132 × Eureka!Oh look, the blue setting is best!
A colleague was asking about it, and especially how hard it would be to, like, save off the "best" model instead of the last
Did a couple tests with Colab, moving the installs and imports up to the top. Results... seem to suggest that doing all the installs/imports before actually running the tokenization and such might fix the problem too?
It's a bit confusing. I made a couple cells at the top, like thus:!pip install clearmlandfrom clearml import Task task = Task.init(project_name="project name", task_name="Esperanto_Bert_2")and# Check that PyTorch sees it import torch torch.cuda.is_available()and
...
This seems similar but not quite the thing I'm looking for: https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-1-setting-an-output-destination-for-model-checkpoints
Yup! That works.from joeynmt.training import train train("transformer_epo_eng_bpe4000.yaml")And it's tracking stuff successfully. Nice
Actually at this point, I'd say it's too late, you might want to just generate new credentials...
I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?
When I was answering the question "are you using a local server", I misinterpreted it as "are you running the agents and queue on a local server station".
Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.
But then I took out all my additions except for pip install clearml andfrom clearml import Task task = Task.init(project_name="project name", task_name="Esperanto_Bert_2")and now I'm not getting the error? But it's still install 1.02. So I'm just thoroughly confused at this point. I'm going to start with a fresh cop of the original colab notebook from https://huggingface.co/blog/how-to-train
Hello! integration in what sense? Training a model? Uploading a model to the hub? Something else?
the parent task ids is what I originally wanted, remember?
I gather there's a distinction between the two, with app.clear being the public cloud-based SaaS version
This seems to work:
` from clearml import Logger
for test_metric in posttrain_metrics:
print(test_metric, posttrain_metrics[test_metric])
#report_scalar(title, series, value, iteration)
Logger.current_logger().report_scalar("test", test_metric, posttrain_metrics[test_metric], 0) `
This was only a means to that end
BroadCoyote44 you, uh, might want to delete the bit of your message with the secret key in it?
I will test both! Thanks for the ideas!
Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):
edit: where I ran it after resetting
No, not specifically 20,in fact more than 20
I'm scrolling through the other thread to see if it's there
Local in the sense that my team member set it up, remote to me
Here's the actual script I'm using
Sounds doable, I will give it a try.
The task.execute_remotely thing is quite interesting, I didn't know about that!