Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

Hi, Iā€™m training on multi-node, clearml captures only a single machine utility (memory/cpu/etc.). I assume it captures node 0. Is there a way to make it report all nodes?

  
  
Posted one year ago
Votes Newest

Answers 9


pytorch ddp @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted one year ago

multiple machines and reporting to the same task.

Out of curiosity , how do you launch it on multiple machines?

reporting to the same task.

So the "funny" think is, they all report on on top (overwriting) the other...
In order for them to report individually, it might be that you need multiple Tasks (i.e. one per machine)
Maybe we could somehow have prefix with rank on the cpu/network etc?! or should it be a different "title", wdyt?

  
  
Posted one year ago

image

  
  
Posted one year ago

It's launched with torchrun https://pytorch.org/docs/stable/elastic/run.html

I think prefix would be great. It can also make it easier for reporting scalars in general (save the users the need to manually add the rank label). It can also be great to support adding the average of all nodes at the UI level, currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.

  
  
Posted one year ago

Hi @<1558624430622511104:profile|PanickyBee11> , how are you doing the multi node training?

  
  
Posted one year ago

I think prefix would be great. It can also make it easier for reporting scalars in general

Actually those are "supposed" to be collected automatically by pytorch and reported by the master node.

currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.

Also "should" be part of pytorch ddp

It's launched with torchrun

I know there is an integration with torchrun (the under the hood infrastructure) effort, I'm not sure where it stands...

So are you using None on multiple machines to "launch" the training process?

I think prefix would be great. It can also make it easier for reporting scalars in general (save the users the need to manually add the rank label). I

So I think this might work (forgive the typos this is not fully tested šŸ™‚ )

def get_resource_monitor_cls():
  from clearml.utilities.resource_monitor import ResourceMonitor
  from clearml.config import get_node_count, get_node_id
  class NodeResourceMonitor(ResourceMonitor):
    _title_machine = ':monitor:machine_{}'.format(get_node_id()) if get_node_count() else ResourceMonitor._title_machine
    _title_gpu = ':monitor:node{}_gpu'.format(get_node_id())) if get_node_count() else ResourceMonitor._title_gpu

task = Task.init(..., auto_resource_monitoring=get_resource_monitor_cls())

If it actually works please PR it šŸ™‚ (it probably should also check that it is being launched with elastic agent)

  
  
Posted one year ago

@<1558624430622511104:profile|PanickyBee11> how are you launching the code on multiple machines ?
are they all reporting to the same Task?

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> yes & yes, multiple machines and reporting to the same task.

  
  
Posted one year ago

Will give it a try tomorrow, thanks!

  
  
Posted one year ago
980 Views
9 Answers
one year ago
one year ago
Tags