Reputation
Badges 1
132 × Eureka!Which is defined, it seems, here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/trainer_tf.py#L459
I suppose the flow would be something like:
select all experiments from project x with iterations greater than y, pull runtime for each one add them all up. I just don't know what API calls to make for 1 and 2
I might not be able to get to that but if you create an issue I'd be happy to link or post what I came up with, wdyt?
One has an active duration of 185502. dividing that by 60 gives you minutes, oh I did the math wrong. Need to divide by 60 again to get hours,
Hang on, CostlyOstrich36 I just noticed that there's a "project compute time" on the dashboard? Do you know how that is calculated/what that is?
Good point! Any pointers to API docs to start looking?
CostlyOstrich36 I made a code snippet for you:
` from clearml import Task
figuring out the project ID
project_list = Task.get_projects() # get all the projects
project_id = Task.get_project_id("your project name here")
getting all the tasks for a project
tasks = Task.get_all(project=[project_id]).response.tasks
loop through and get approximate maximum gpu-seconds by type.
import random
from collections import defaultdict
task = random.choice(tasks)
print(dir(task))
print(task.runtim...
CostlyOstrich36 I get some weird results, for "active duration".
For example, several of the experiments show that their active duration is more than 90 days, but I definitely didn't run them that long.
OK, definitely fix that in the snippet, lol
CostlyOstrich36 nice, thanks for the link. I know that in "info" on the experiments dashboard it includes gpu_type and started/completed times, I'll give it a go based on that
CostlyOstrich36 at the bottom of the screenshot it says "Compute Time: 440 days"
Here's the hours/days version, corrected now lol:gpu_hours = {} gpu_days = {} for gpu_type, gpu_time_seconds in gpu_seconds.items(): gpu_time_hours = gpu_time_seconds/3600 gpu_hours[gpu_type] = gpu_time_hours gpu_days[gpu_type] = gpu_time_hours/24
Hmm, I tried publishing one and it doesn't seem to have worked quite that easily: https://app.pro.clear.ml/projects/b4a1875539cb4d9798529439801402ee/experiments/6f4cb4718c7c4a25b3a041c63f6ff2b4/execution?columns=selected&columns=type&columns=last_iteration&columns=hyperparams.Args.num_train_epochs&columns=name&columns=status&columns=users&columns=started&columns=last_update&columns=tags&columns=parent.name&columns=project.name&columns=m.2eed1fe0db36d674643b5f84d2adf46e.06eaeb413e7213cb8b5419...
OK, so if I've got, like, 2x16GB GPUs and 2x32GB I could allocate all the 16GB GPUs to one Queue? And all the 32GB ones to another?
It would certainly be nice to have. Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.
I know it's running these lines, which get defined in https://github.com/huggingface/transformers/blob/f6e254474cb4f90f8a168a599b9aaf3544c37890/src/transformers/trainer_pt_utils.py#L828trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics)
it's one where I reset it, and cleared out the Installed Packages to only have transformers @ git+ https://github.com/huggingface/transformers@61c506349134db0a0a2fd6fb2eff8e29a2f84e79 in it.
essentially running this: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
Yes, it trains fine. I can even look at the console output
something like this is what I'm looking for
Then when I queue up a job on the 1x16gb queue it would run on one of the two GPUs?
Aggregating the sort of range of all the runs, maybe like a hurricane track?
OK, I added
Task.current_task().upload_artifact(name='trainer_state', artifact_object=os.path.join(output_dir, "trainer_state.json"))
after this line:
And it seems to be working.
Oh, btw, I assume you mean http://clear.ml , not http://clearml.ml ?
I know the documentation says that you can give it a wildcard or pathlib Path - but I'm still not quite sure exactly how to tell it "top-level files only, not subfolders".
I've been trying to do things like "color these five experiments one color, color these other five a different color", but then once I maximize the thing the colors all change
Well, I can just work around it now that I know, by creating a folder with no subfolders and uploading that. But... 🤔 perhaps allow the interface to take in a list or generator? As in,files_to_upload = [f for f in output_dir.glob("*") if f.is_file()] Task.current_task().upload_artifact( "best_checkpoint", artifact_object=files_to_upload)And then it could zip up the list and name it "best_checkpoint"?
Oh yeah, that's been bugging me for a while