UPD: If I use --ntask-per-node=2
then ClearML creates 2 tasks, but I need only 1.
Hi @<1569496075083976704:profile|SweetShells3>
Try to do:
import torch.distributed as dist
if dist.get_rank()==0:
task = Task.init(...)
This will make sure only the "master" process is logged
or
if int(os.environ.get('RANK'))==0:
task = Task.init(...)
@<1523701205467926528:profile|AgitatedDove14> in this case I get AttributeError: 'NoneType' object has no attribute 'report_scalar'
on trainer.fit(...)
And Logger.current_logger()
- I think non-master processes trying to log something, but have no Logger instance because have no Task instance.
What am I suppose to do to log training correctly? Logs in master process include all training history or I need to concatenate logs from different nodes somehow?
I think non-master processes trying to log something, but have no Logger instance because have no Task instance.
Hmm is your code calling Logger.current_logger()
directly ?
Logs in master process include all training history or I need to concatenate logs from different nodes somehow?
So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML supports SLURM and does exactly that (actually the launching itself is from the ClearML UI and slurm does the scheduling, and then everything is taken care of)
Can you think of a way to pass info from master to second node ? You can always of course limit reporting in your code in case you are not the master
@<1523701205467926528:profile|AgitatedDove14> Yes, I have some Logger.current_logger()
callings in model class.
If I turn off logging on non-master nodes with RANK checking, I won't loose training logs from non-master nodes (I mean all training logs are on master node, aren't they) ?
Yes they are supposed to be routed there by pytorch dist
(and the TB logs are on the master only anyhow)
@<1523701205467926528:profile|AgitatedDove14> Okay, thank you so much for your help!