Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, I'M Not Getting Training Metrics Tracked By Clearml When I Execute The A Training Script Remotely, But I Get Them If I Run Locally. Is It Because I Have A Task.Init() In The File? What Happens When You Remotely Run A Script Which Has An Init() In I

Hello, I'm not getting training metrics tracked by ClearML when I execute the a training script remotely, but I get them if I run locally. Is it because I have a Task.init() in the file? What happens when you remotely run a script which has an init() in it?

Specifically, I get loss curves and validation metrics and the like in train under scalars when I run it locally, but if I, say, clone job and enqueue it on a remote queue, I only get monitor:gpu and monitor:machine

The first few lines of the script are:
` from clearml import Task, Dataset

Colin: Add ClearML task.

task = Task.init(
project_name="project name",
task_name="whynoscalars"
) `
What it looks like when I run locally vs on the remote queue are attached:

  
  
Posted one year ago
Votes Newest

Answers 30


IrritableOwl63 in the profile page, look at the bottom right corner

  
  
Posted one year ago

Before I enqueued the job, I manually edited Installed Packages thus:
boto3 datasets clearml tokenizers torchand added
pip install git+to the setup script.

And the docker image is
nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04

I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500

  
  
Posted one year ago

Do I get the server version from the https://app.pro.clear.ml UI somewhere SuccessfulKoala55 ?

  
  
Posted one year ago

SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.

IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?

  
  
Posted one year ago

I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?

  
  
Posted one year ago

Yes, it trains fine. I can even look at the console output

  
  
Posted one year ago

And how do you log the metrics in your code?

  
  
Posted one year ago

here's console output with loss being output

  
  
Posted one year ago

This is when running remotely, right?

  
  
Posted one year ago

yup

  
  
Posted one year ago

not much different from the HuggingFace version, I believe

  
  
Posted one year ago

Here's the actual script I'm using

  
  
Posted one year ago

Can you move the Task.init() call to the main() function?

  
  
Posted one year ago

Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):

edit: where I ran it after resetting

  
  
Posted one year ago

Anyhow, it seems that moving it to main() didn't help. Any ideas?

  
  
Posted one year ago

Local in the sense that my team member set it up, remote to me

  
  
Posted one year ago

Also, what ClearML SDK version?

  
  
Posted one year ago

Are you using a local server?

  
  
Posted one year ago

I'm scrolling through the other thread to see if it's there

  
  
Posted one year ago

That's what I meant 🙂

  
  
Posted one year ago

Server version?

  
  
Posted one year ago

Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.

  
  
Posted one year ago

SuccessfulKoala55 the clearml version on the server, according to my colleague, is:
clearml-agent --version CLEARML-AGENT version 1.0.0

  
  
Posted one year ago

As in, I edit Installed Packages, delete everything there, and put that particular list of packages.

  
  
Posted one year ago

And the server version? You can see it in the profile page

  
  
Posted one year ago

When I was answering the question "are you using a local server", I misinterpreted it as "are you running the agents and queue on a local server station".

  
  
Posted one year ago

In my profile page it's 1.0.2

  
  
Posted one year ago

Before I enqueued the job, I manually edited Installed Packages thus

Didn't it already have clearml in the dependencies?

  
  
Posted one year ago

Sure, I can give that a try!

  
  
Posted one year ago
121 Views
30 Answers
one year ago
4 months ago
Tags
Similar posts