Reputation
Badges 1
132 × Eureka!AgitatedDove14 I'm making some progress on this. I've currently got the situation that my training run saved all of these files, and Task.get_task(param['TaskA']).models['output''][-1]
gets me just one of them, training_args.bin
. Then -2
gets me another, rng_state.pth
If I just get Task.get_task(param['TaskA']).models['output']
, I end up getting a huge list of, like, ` [<clearml.model.Model object at 0x7fec2841c880>, <clearml.model.Model object at 0x7fec2841...
Interesting, I wasn't aware of the possibilities you outline there at the end, where you, like, programmatically pull all the results down for all the tasks. Neat!
A more complex version of this which I'm trying to figure out:
I trained a model using TaskA. I need to now pull that model down from the saved artifacts of TaskA and fine-tune it in TaskB That finetuning in TaskB spits out a metric.
Is there a way to do this all elegantly? Currently my process is to manually download the model...
BroadCoyote44 you, uh, might want to delete the bit of your message with the secret key in it?
Hang on, CostlyOstrich36 I just noticed that there's a "project compute time" on the dashboard? Do you know how that is calculated/what that is?
OK, definitely fix that in the snippet, lol
Here's the hours/days version, corrected now lol:gpu_hours = {} gpu_days = {} for gpu_type, gpu_time_seconds in gpu_seconds.items(): gpu_time_hours = gpu_time_seconds/3600 gpu_hours[gpu_type] = gpu_time_hours gpu_days[gpu_type] = gpu_time_hours/24
Yeah that should work. Basically in --train_file
it needs the path to train.txt, --validation_file
needs the path to validation.txt, etc. I just put them all in the same folder for convenience
What I'm curious about is how clearML hooks into that to know to upload the other artifacts such as http://optimizer.pt .
OK, neat! Any advice on how to edit the training loop to do that? Because the code I'm using doesn't offer easy access to the training loop, see here: https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/examples/pytorch/language-modeling/run_mlm.py#L469
trainer.train()
just does the training loop automagically, and saves a checkpoint once in a while. When it saves a checkpoint, clearML uploads all the other files. How can I hook into... whatever ...
OK, I added
Task.current_task().upload_artifact(name='trainer_state', artifact_object=os.path.join(output_dir, "trainer_state.json"))
after this line:
And it seems to be working.
Yes, it trains fine. I can even look at the console output
One has an active duration of 185502. dividing that by 60 gives you minutes, oh I did the math wrong. Need to divide by 60 again to get hours,
Actually at this point, I'd say it's too late, you might want to just generate new credentials...
here's console output with loss being output
AgitatedDove14 yes, I called init and tensorboard is installed. It successfully uploaded the metrics from trainer.train(), just not from the next cell where we do trainer.predict
Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):
edit: where I ran it after resetting
Gave it a try, it seems our GPU Queue doesn't have the S3 creds set up correctly. Making a separate thread about that
SuccessfulKoala55 the clearml version on the server, according to my colleague, is:clearml-agent --version CLEARML-AGENT version 1.0.0
I'm scrolling through the other thread to see if it's there
Before I enqueued the job, I manually edited Installed Packages thus:boto3 datasets clearml tokenizers torch
and addedpip install git+
to the setup script.
And the docker image isnvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04
I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500
not much different from the HuggingFace version, I believe
When I was answering the question "are you using a local server", I misinterpreted it as "are you running the agents and queue on a local server station".
SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.
IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?
I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?
Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.
Good point! Any pointers to API docs to start looking?