Reputation
Badges 1
132 × Eureka!This seems to work:
` from clearml import Logger
for test_metric in posttrain_metrics:
print(test_metric, posttrain_metrics[test_metric])
#report_scalar(title, series, value, iteration)
Logger.current_logger().report_scalar("test", test_metric, posttrain_metrics[test_metric], 0) `
This seems similar but not quite the thing I'm looking for: https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-1-setting-an-output-destination-for-model-checkpoints
What I'm curious about is how clearML hooks into that to know to upload the other artifacts such as http://optimizer.pt .
I've been trying to do things like "color these five experiments one color, color these other five a different color", but then once I maximize the thing the colors all change
BroadCoyote44 you, uh, might want to delete the bit of your message with the secret key in it?
Or examples of, like, "select all experiments in project with iterations > 0"?
Hang on, CostlyOstrich36 I just noticed that there's a "project compute time" on the dashboard? Do you know how that is calculated/what that is?
One has an active duration of 185502. dividing that by 60 gives you minutes, oh I did the math wrong. Need to divide by 60 again to get hours,
CostlyOstrich36 I get some weird results, for "active duration".
For example, several of the experiments show that their active duration is more than 90 days, but I definitely didn't run them that long.
OK, definitely fix that in the snippet, lol
I might not be able to get to that but if you create an issue I'd be happy to link or post what I came up with, wdyt?
CostlyOstrich36 I made a code snippet for you:
` from clearml import Task
figuring out the project ID
project_list = Task.get_projects() # get all the projects
project_id = Task.get_project_id("your project name here")
getting all the tasks for a project
tasks = Task.get_all(project=[project_id]).response.tasks
loop through and get approximate maximum gpu-seconds by type.
import random
from collections import defaultdict
task = random.choice(tasks)
print(dir(task))
print(task.runtim...
CostlyOstrich36 at the bottom of the screenshot it says "Compute Time: 440 days"
essentially running this: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
So in theory we could hook into one of those functions and add a line to have ClearML upload that particular json we want
Yes, it trains fine. I can even look at the console output
OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469
trainer.train()
just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever ...
My other question is: how does it decide what to upload automatically? It picked up almost everything, just not trainer_state.json. Which I'm actually not quite sure is necessary
OK, I added
Task.current_task().upload_artifact(name='trainer_state', artifact_object=os.path.join(output_dir, "trainer_state.json"))
after this line:
And it seems to be working.
Presumably the correct way to do this is to fork the transformers library, make the change, and add that version to my requirements.txt
Which is defined, it seems, here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L459
Yeah, we don't even get to line 480, all the training loop is within line 469, I think.
I guess I could try and edit that, somehow. Hmm
Oh, here's an example, a screenshot I took of the files in my Colab instance:
I think the model state is just post training loop (not inside the loop), no?
trainer_state.json gets updated every time a "checkpoint" gets saved. I've got that set to once an epoch.
My testing indicates that if the training gets interrupted, I can resume training from a saved checkpoint folder that includes trainer_state.json. It uses the info to determine which data to skip, where to pick back up again, etc
oooh, that's awesome lol. Never thought to do it that way