Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

Unanswered

I think prefix would be great. It can also make it easier for reporting scalars in general

Actually those are "supposed" to be collected automatically by pytorch and reported by the master node.

currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.

Also "should" be part of pytorch ddp

It's launched with torchrun

I know there is an integration with torchrun (the under the hood infrastructure) effort, I'm not sure where it stands...

So are you using None on multiple machines to "launch" the training process?

I think prefix would be great. It can also make it easier for reporting scalars in general (save the users the need to manually add the rank label). I

So I think this might work (forgive the typos this is not fully tested 🙂 )

def get_resource_monitor_cls():
  from clearml.utilities.resource_monitor import ResourceMonitor
  from clearml.config import get_node_count, get_node_id
  class NodeResourceMonitor(ResourceMonitor):
    _title_machine = ':monitor:machine_{}'.format(get_node_id()) if get_node_count() else ResourceMonitor._title_machine
    _title_gpu = ':monitor:node{}_gpu'.format(get_node_id())) if get_node_count() else ResourceMonitor._title_gpu

task = Task.init(..., auto_resource_monitoring=get_resource_monitor_cls())

If it actually works please PR it 🙂 (it probably should also check that it is being launched with elastic agent)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

286 Views

0 Answers

2 years ago