Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):
edit: where I ran it after resetting
Local in the sense that my team member set it up, remote to me
Can you move the Task.init()
call to the main()
function?
here's console output with loss being output
When I was answering the question "are you using a local server", I misinterpreted it as "are you running the agents and queue on a local server station".
As in, I edit Installed Packages, delete everything there, and put that particular list of packages.
Anyhow, it seems that moving it to main() didn't help. Any ideas?
Yes, it trains fine. I can even look at the console output
And how do you log the metrics in your code?
SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.
IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?
I'm scrolling through the other thread to see if it's there
Before I enqueued the job, I manually edited Installed Packages thus
Didn't it already have clearml
in the dependencies?
Before I enqueued the job, I manually edited Installed Packages thus:boto3 datasets clearml tokenizers torch
and addedpip install git+
to the setup script.
And the docker image isnvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04
I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500
essentially running this: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
IrritableOwl63 in the profile page, look at the bottom right corner
SuccessfulKoala55 the clearml version on the server, according to my colleague, is:clearml-agent --version CLEARML-AGENT version 1.0.0
Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.
And the server version? You can see it in the profile page
not much different from the HuggingFace version, I believe
Do I get the server version from the https://app.pro.clear.ml UI somewhere SuccessfulKoala55 ?
I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?