Hi Everyone! I Try To Run Pytorch Lightning Code On Slurm With Srun Script Like This (

Answered

Hi everyone!

I try to run Pytorch Lightning code on SLURM with srun script like this ( https://pytorch-lightning.readthedocs.io/en/1.2.10/clouds/slurm.html ). If I use ClearML in code and try to make only one task for SLURM job, I use params like this

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256gb
#SBATCH --gres=gpu:2

But this config runs 1-gpu training somehow instead of 2-gpu training. Can someone give me advice what's going wrong?

Thank you in advance!

  				
Posted 
	one year ago

					More  		
  Report
		
					SweetShells3
				
					0
					 × 1

Votes Newest

Answers 7

I think non-master processes trying to log something, but have no Logger instance because have no Task instance.

Hmm is your code calling Logger.current_logger() directly ?

Logs in master process include all training history or I need to concatenate logs from different nodes somehow?

So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML supports SLURM and does exactly that (actually the launching itself is from the ClearML UI and slurm does the scheduling, and then everything is taken care of)
Can you think of a way to pass info from master to second node ? You can always of course limit reporting in your code in case you are not the master

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi SweetShells3
Try to do:

import torch.distributed as dist

if dist.get_rank()==0:
  task = Task.init(...)

This will make sure only the "master" process is logged

if int(os.environ.get('RANK'))==0:
  task = Task.init(...)

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Yes, I have some Logger.current_logger() callings in model class.

If I turn off logging on non-master nodes with RANK checking, I won't loose training logs from non-master nodes (I mean all training logs are on master node, aren't they) ?

  				
Posted 
	one year ago

					More  		
  Report
		
					SweetShells3
				
					0
					 × 1

AgitatedDove14 in this case I get AttributeError: 'NoneType' object has no attribute 'report_scalar' on trainer.fit(...) And Logger.current_logger() - I think non-master processes trying to log something, but have no Logger instance because have no Task instance.

What am I suppose to do to log training correctly? Logs in master process include all training history or I need to concatenate logs from different nodes somehow?

  				
Posted 
	one year ago

					More  		
  Report
		
					SweetShells3
				
					0
					 × 1

Yes they are supposed to be routed there by pytorch dist
(and the TB logs are on the master only anyhow)

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

UPD: If I use --ntask-per-node=2 then ClearML creates 2 tasks, but I need only 1.

  				
Posted 
	one year ago

					More  		
  Report
		
					SweetShells3
				
					0
					 × 1

AgitatedDove14 Okay, thank you so much for your help!

  				
Posted 
	one year ago

					More  		
  Report
		
					SweetShells3
				
					0
					 × 1

Write your answer

1K Views

7 Answers

one year ago