PanickyBee11

3 Questions, 8 Answers

Active since 17 April 2023

Last activity 9 months ago

Reputation

Badges 1

8 × Eureka!

Questions 3
Answers 8

0 Votes

9 Answers

2K Views

0 Votes 9 Answers 2K Views

Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

Hi, I’m training on multi-node, clearml captures only a single machine utility (memory/cpu/etc.). I assume it captures node 0. Is there a way to make it repo...

clearml

2 years ago

0 Votes

5 Answers

812 Views

0 Votes 5 Answers 812 Views

Hi, I'M Using Huggingface Trainer, Is There A Way To Capture Grad_Norm Per Layer? Thanks!

hi, I'm using huggingface trainer, is there a way to capture grad_norm per layer? Thanks!

clearml

9 months ago

0 Votes

3 Answers

2K Views

0 Votes 3 Answers 2K Views

Is It Possible To Run In Offline Mode And Still Save The Machine Monitoring Metrics? By Default It Is Monitored For Me In Online Mode But Not In Offline Mode.

Is it possible to run in offline mode and still save the machine monitoring metrics? By default it is monitored for me in online mode but not in offline mode.

clearml

2 years ago

0 Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

@<1523701205467926528:profile|AgitatedDove14> yes & yes, multiple machines and reporting to the same task.

2 years ago

0 Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

Will give it a try tomorrow, thanks!

2 years ago

0 Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

pytorch ddp @<1523701070390366208:profile|CostlyOstrich36>

2 years ago

0 Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

2 years ago

0 Hi, I’M Training On Multi-Node, Clearml Captures Only A Single Machine Utility (Memory/Cpu/Etc.). I Assume It Captures Node 0. Is There A Way To Make It Report All Nodes?

It's launched with torchrun https://pytorch.org/docs/stable/elastic/run.html

I think prefix would be great. It can also make it easier for reporting scalars in general (save the users the need to manually add the rank label). It can also be great to support adding the average of all nodes at the UI level, currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.

2 years ago

0 Hi, I'M Using Huggingface Trainer, Is There A Way To Capture Grad_Norm Per Layer? Thanks!

I mean that HF trainer by default reports to clearml a single grad_norm scalar for the whole model. I wonder if I can extend this to reporting grad_norm per layer.

9 months ago

0 Hi, I'M Using Huggingface Trainer, Is There A Way To Capture Grad_Norm Per Layer? Thanks!

I guess they don’t, is there an easy way to add to the HF trainer some callbacks for reporting extra info?

9 months ago

0 Is It Possible To Run In Offline Mode And Still Save The Machine Monitoring Metrics? By Default It Is Monitored For Me In Online Mode But Not In Offline Mode.

I do see these metrics in the json file, but they are not shown in the UI

2 years ago