
Reputation
Badges 1
8 × Eureka!I guess they don’t, is there an easy way to add to the HF trainer some callbacks for reporting extra info?
I mean that HF trainer by default reports to clearml a single grad_norm scalar for the whole model. I wonder if I can extend this to reporting grad_norm per layer.
It's launched with torchrun https://pytorch.org/docs/stable/elastic/run.html
I think prefix would be great. It can also make it easier for reporting scalars in general (save the users the need to manually add the rank label). It can also be great to support adding the average of all nodes at the UI level, currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.
pytorch ddp @<1523701070390366208:profile|CostlyOstrich36>
@<1523701205467926528:profile|AgitatedDove14> yes & yes, multiple machines and reporting to the same task.
Will give it a try tomorrow, thanks!
I do see these metrics in the json file, but they are not shown in the UI