Hi Everyone! I Try To Run Pytorch Lightning Code On Slurm With Srun Script Like This (

Answered

Hi everyone!

I try to run Pytorch Lightning code on SLURM with srun script like this ( https://pytorch-lightning.readthedocs.io/en/1.2.10/clouds/slurm.html ). If I use ClearML in code and try to make only one task for SLURM job, I use params like this

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256gb
#SBATCH --gres=gpu:2

But this config runs 1-gpu training somehow instead of 2-gpu training. Can someone give me advice what's going wrong?

Thank you in advance!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SweetShells3
				
					0
					 × 1

Votes Newest

Answers 7

@<1523701205467926528:profile|AgitatedDove14> Okay, thank you so much for your help!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SweetShells3
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> in this case I get AttributeError: 'NoneType' object has no attribute 'report_scalar' on trainer.fit(...) And Logger.current_logger() - I think non-master processes trying to log something, but have no Logger instance because have no Task instance.

What am I suppose to do to log training correctly? Logs in master process include all training history or I need to concatenate logs from different nodes somehow?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SweetShells3
				
					0
					 × 1

UPD: If I use --ntask-per-node=2 then ClearML creates 2 tasks, but I need only 1.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SweetShells3
				
					0
					 × 1

Hi @<1569496075083976704:profile|SweetShells3>
Try to do:

import torch.distributed as dist

if dist.get_rank()==0:
  task = Task.init(...)

This will make sure only the "master" process is logged

if int(os.environ.get('RANK'))==0:
  task = Task.init(...)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes they are supposed to be routed there by pytorch dist
(and the TB logs are on the master only anyhow)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> Yes, I have some Logger.current_logger() callings in model class.

If I turn off logging on non-master nodes with RANK checking, I won't loose training logs from non-master nodes (I mean all training logs are on master node, aren't they) ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SweetShells3
				
					0
					 × 1

I think non-master processes trying to log something, but have no Logger instance because have no Task instance.

Hmm is your code calling Logger.current_logger() directly ?

Logs in master process include all training history or I need to concatenate logs from different nodes somehow?

So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML supports SLURM and does exactly that (actually the launching itself is from the ClearML UI and slurm does the scheduling, and then everything is taken care of)
Can you think of a way to pass info from master to second node ? You can always of course limit reporting in your code in case you are not the master

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

7 Answers

2 years ago