Answered

Hello, I'M Not Getting Training Metrics Tracked By Clearml When I Execute The A Training Script Remotely, But I Get Them If I Run Locally. Is It Because I Have A Task.Init() In The File? What Happens When You Remotely Run A Script Which Has An Init() In I

Hello, I'm not getting training metrics tracked by ClearML when I execute the a training script remotely, but I get them if I run locally. Is it because I have a Task.init() in the file? What happens when you remotely run a script which has an init() in it?

Specifically, I get loss curves and validation metrics and the like in train under scalars when I run it locally, but if I, say, clone job and enqueue it on a remote queue, I only get monitor:gpu and monitor:machine

The first few lines of the script are:
` from clearml import Task, Dataset

Colin: Add ClearML task.

task = Task.init(
project_name="project name",
task_name="whynoscalars"
) `
What it looks like when I run locally vs on the remote queue are attached:

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Votes Newest

Answers 30

When I was answering the question "are you using a local server", I misinterpreted it as "are you running the agents and queue on a local server station".

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.

IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

In my profile page it's 1.0.2

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

IrritableOwl63 in the profile page, look at the bottom right corner

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Do I get the server version from the https://app.pro.clear.ml UI somewhere SuccessfulKoala55 ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					IrritableOwl63
				
					0
					 × 1

And the server version? You can see it in the profile page

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 the clearml version on the server, according to my colleague, is:
clearml-agent --version CLEARML-AGENT version 1.0.0

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

As in, I edit Installed Packages, delete everything there, and put that particular list of packages.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Before I enqueued the job, I manually edited Installed Packages thus

Didn't it already have clearml in the dependencies?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I'm scrolling through the other thread to see if it's there

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Server version?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

That's what I meant 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Local in the sense that my team member set it up, remote to me

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Also, what ClearML SDK version?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Are you using a local server?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Anyhow, it seems that moving it to main() didn't help. Any ideas?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Before I enqueued the job, I manually edited Installed Packages thus:
boto3 datasets clearml tokenizers torchand added
pip install git+to the setup script.

And the docker image is
nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04

I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):

edit: where I ran it after resetting

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Sure, I can give that a try!

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Can you move the Task.init() call to the main() function?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

not much different from the HuggingFace version, I believe

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Here's the actual script I'm using

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

essentially running this: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

yup

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

This is when running remotely, right?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

here's console output with loss being output

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

And how do you log the metrics in your code?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes, it trains fine. I can even look at the console output

  				
Posted 
	3 years ago

					More  		
  Report
		
					SmallDeer34
				
					0
					 × 1

Write your answer

2K Views

30 Answers

3 years ago

2 years ago