Unanswered
Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?
It's launched with torchrun https://pytorch.org/docs/stable/elastic/run.html
I think prefix would be great. It can also make it easier for reporting scalars in general (save the users the need to manually add the rank label). It can also be great to support adding the average of all nodes at the UI level, currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.
160 Views
0
Answers
one year ago
one year ago