Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, I'M Not Getting Training Metrics Tracked By Clearml When I Execute The A Training Script Remotely, But I Get Them If I Run Locally. Is It Because I Have A Task.Init() In The File? What Happens When You Remotely Run A Script Which Has An Init() In I

Hello, I'm not getting training metrics tracked by ClearML when I execute the a training script remotely, but I get them if I run locally. Is it because I have a Task.init() in the file? What happens when you remotely run a script which has an init() in it?

Specifically, I get loss curves and validation metrics and the like in train under scalars when I run it locally, but if I, say, clone job and enqueue it on a remote queue, I only get monitor:gpu and monitor:machine

The first few lines of the script are:
` from clearml import Task, Dataset

Colin: Add ClearML task.

task = Task.init(
project_name="project name",
task_name="whynoscalars"
) `
What it looks like when I run locally vs on the remote queue are attached:

  
  
Posted 3 years ago
Votes Newest

Answers 30


Do I get the server version from the https://app.pro.clear.ml UI somewhere SuccessfulKoala55 ?

  
  
Posted 3 years ago

As in, I edit Installed Packages, delete everything there, and put that particular list of packages.

  
  
Posted 3 years ago

That's what I meant 🙂

  
  
Posted 3 years ago

Also, what ClearML SDK version?

  
  
Posted 3 years ago

yup

  
  
Posted 3 years ago

not much different from the HuggingFace version, I believe

  
  
Posted 3 years ago

Can you move the Task.init() call to the main() function?

  
  
Posted 3 years ago

Sure, I can give that a try!

  
  
Posted 3 years ago

And how do you log the metrics in your code?

  
  
Posted 3 years ago

here's console output with loss being output

  
  
Posted 3 years ago

This is when running remotely, right?

  
  
Posted 3 years ago

Yes, it trains fine. I can even look at the console output

  
  
Posted 3 years ago

Long story, but in the other thread I couldn't install the particular version of transformers unless I removed it from "Installed Packages" and added it to setup script instead. So I took to just throwing in that list of packages.

  
  
Posted 3 years ago

Before I enqueued the job, I manually edited Installed Packages thus

Didn't it already have clearml in the dependencies?

  
  
Posted 3 years ago

SuccessfulKoala55 the clearml version on the server, according to my colleague, is:
clearml-agent --version CLEARML-AGENT version 1.0.0

  
  
Posted 3 years ago

And the server version? You can see it in the profile page

  
  
Posted 3 years ago

In my profile page it's 1.0.2

  
  
Posted 3 years ago

Local in the sense that my team member set it up, remote to me

  
  
Posted 3 years ago

I went to https://app.pro.clear.ml/profile and looked in the bottom right. But would this tell us about the version of the server run by Dan?

  
  
Posted 3 years ago

When I was answering the question "are you using a local server", I misinterpreted it as "are you running the agents and queue on a local server station".

  
  
Posted 3 years ago

Here's the actual script I'm using

  
  
Posted 3 years ago

Before I enqueued the job, I manually edited Installed Packages thus:
boto3 datasets clearml tokenizers torchand added
pip install git+to the setup script.

And the docker image is
nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04

I did all that because I've been having this other issue: https://clearml.slack.com/archives/CTK20V944/p1624892113376500

  
  
Posted 3 years ago

Tried it. Updated the script (attached) to add it to the main function instead. Then ran it locally. Then aborted the job. Then "reset" the job on clearML web interface and ran it remotely on a GPU queue. as you can see in the log (attached) there is loss happening, but it's not showing up in the scalars (attached picture):

edit: where I ran it after resetting

  
  
Posted 3 years ago

IrritableOwl63 in the profile page, look at the bottom right corner

  
  
Posted 3 years ago

Are you using a local server?

  
  
Posted 3 years ago

Anyhow, it seems that moving it to main() didn't help. Any ideas?

  
  
Posted 3 years ago

Server version?

  
  
Posted 3 years ago

I'm scrolling through the other thread to see if it's there

  
  
Posted 3 years ago

SuccessfulKoala55 I think I just realized I had a misunderstanding. I don't think we are running a local server version of ClearML, no. We have a workstation running a queue/agents, but ClearML itself is via http://app.pro.clear.ml , I don't think we have ClearML running locally. We were tracking experiments before we setup the queue and the workers and all that.

IrritableOwl63 can you confirm - we didn't setup our own server to, like, handle experiment tracking and such?

  
  
Posted 3 years ago
1K Views
30 Answers
3 years ago
one year ago
Tags
Similar posts