Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone! I Try To Run Pytorch Lightning Code On Slurm With Srun Script Like This (

Hi everyone!

I try to run Pytorch Lightning code on SLURM with srun script like this ( https://pytorch-lightning.readthedocs.io/en/1.2.10/clouds/slurm.html ). If I use ClearML in code and try to make only one task for SLURM job, I use params like this

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256gb
#SBATCH --gres=gpu:2

But this config runs 1-gpu training somehow instead of 2-gpu training. Can someone give me advice what's going wrong?

Thank you in advance!

  
  
Posted one year ago
Votes Newest

Answers 7


Hi @<1569496075083976704:profile|SweetShells3>
Try to do:

import torch.distributed as dist

if dist.get_rank()==0:
  task = Task.init(...)

This will make sure only the "master" process is logged

or

if int(os.environ.get('RANK'))==0:
  task = Task.init(...)
  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> in this case I get AttributeError: 'NoneType' object has no attribute 'report_scalar' on trainer.fit(...) And Logger.current_logger() - I think non-master processes trying to log something, but have no Logger instance because have no Task instance.

What am I suppose to do to log training correctly? Logs in master process include all training history or I need to concatenate logs from different nodes somehow?

  
  
Posted one year ago

I think non-master processes trying to log something, but have no Logger instance because have no Task instance.

Hmm is your code calling Logger.current_logger() directly ?

Logs in master process include all training history or I need to concatenate logs from different nodes somehow?

So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML supports SLURM and does exactly that (actually the launching itself is from the ClearML UI and slurm does the scheduling, and then everything is taken care of)
Can you think of a way to pass info from master to second node ? You can always of course limit reporting in your code in case you are not the master

  
  
Posted one year ago

Yes they are supposed to be routed there by pytorch dist
(and the TB logs are on the master only anyhow)

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> Yes, I have some Logger.current_logger() callings in model class.

If I turn off logging on non-master nodes with RANK checking, I won't loose training logs from non-master nodes (I mean all training logs are on master node, aren't they) ?

  
  
Posted one year ago

UPD: If I use --ntask-per-node=2 then ClearML creates 2 tasks, but I need only 1.

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> Okay, thank you so much for your help!

  
  
Posted one year ago
1K Views
7 Answers
one year ago
one year ago
Tags