Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Trying To Use Clearml On Pytorch-Lightning With Multiple Gpus, But It Seems As If The Server Does Not Monitor The Experiment. I Can See No Progress In The Console (Steps Counter Stays On 0) Nor Any Tensorboard Loggings. On A Single Gpu Everything

Hi,
I'm trying to use clearml on pytorch-lightning with multiple gpus, but it seems as if the server does not monitor the experiment. I can see no progress in the console (steps counter stays on 0) nor any tensorboard loggings. on a single GPU everything works fine.

ubuntu 20.04
dgx (A100X8)
python 3.8.12
torch 1.10.0
pytorch-lightning 1.6.0
cleaml 1.3.0
clearml-agent 1.1.2

  
  
Posted 2 years ago
Votes Newest

Answers 21


I see. Just to simplify the issue - When using pytorch - were you getting machine usage reports (CPU/GPU usage)?
Also, I'm guessing various scalars aren't being reported. I'm guessing those were previously captured automatically by Clearml?

  
  
Posted 2 years ago

we used to use pytorch and it worked just fine, but now we moved to pytorch-lightning (kind of extension on pytorch that gives keras-ish functionality)

  
  
Posted 2 years ago

Hi Natan,
agent command: clearml-agent daemon --gpu all
I'm using 8 gpus. the model runs on all of them, but the logging isn't working

  
  
Posted 2 years ago

Hello. Sorry for bringing up the thread. I am facing the same issue on clearml-agent version 1.4.1 and clearml version 1.8.0 . Can you please point me to a github issue FancyTurkey50 or any resolution CostlyOstrich36 ?

  
  
Posted 2 years ago

we used the pytorch with multi-gpu (ddp)

  
  
Posted 2 years ago

with one gpu it works fine

  
  
Posted 2 years ago

will try 2

  
  
Posted 2 years ago

thanks!

  
  
Posted 2 years ago

The issue has been resolved. Details in the same github issue https://github.com/allegroai/clearml/issues/635#issuecomment-1324870817
CostlyOstrich36 FancyTurkey50 in case this was still unresolved at your end.

  
  
Posted 2 years ago

I'm still getting the machine usage reports

  
  
Posted 2 years ago

Regular pytorch - you mean single GPU (I'm not familiar with torch distributed)?
Also just to give it a try, can you test with only 2 GPU's for example?

  
  
Posted 2 years ago

And everything works fine with regular pytorch

  
  
Posted 2 years ago

I have posted an update on a relevant issue - https://github.com/allegroai/clearml/issues/635

  
  
Posted 2 years ago

FancyTurkey50 , could you open a github issue for this so we could follow it? I'm quite curious

  
  
Posted 2 years ago

Cool, thanks for the info! I'll try to play with it as well 🙂

  
  
Posted 2 years ago

and it doesnt work for 2 gpus either

  
  
Posted 2 years ago

with regular pytorch it worked when running on all 8 gpus

  
  
Posted 2 years ago

Also, what if you try using only one GPU with pytorch-lightning? Still nothing is reported - i.e. console/scalars?

  
  
Posted 2 years ago

Also, how many GPUs are you trying to run off?

  
  
Posted 2 years ago

But no console or auto capturing of scalars?

  
  
Posted 2 years ago

Hi FancyTurkey50 , how did you run the agent command?

  
  
Posted 2 years ago
1K Views
21 Answers
2 years ago
one year ago
Tags