I Am Using Pytorch Lightning With Ddp Accelerator On 4 Gpus, And I Found Every Checkpoint Is Recorded 4 Times On Web Ui With Different Ids. One Is On

I am using pytorch lightning with ddp accelerator on 4 gpus, and I found every checkpoint is recorded 4 times on web UI with different ids. One is on default_output_uri ( s3://...) and three are on file///... . Is this a feature or bug?

Posted 3 years ago
Votes Newest

Answers 7

DefeatedOstrich93 can you verify lightning actually only stored once ?

Posted 3 years ago

Hi! Looks like all the processes are calling torch.save so it's probably reflecting what Lightning did behind the curtain. Definitely not a feature though. Do you mind reporting this to our github repo? Also, are you also getting duplicate experiments?

Posted 3 years ago

No running duplicate exps. Which repo to report? clearml-agent, clearml or clearml-server.

Posted 3 years ago

Thanks for the report!

Posted 3 years ago

ClearML thanks

Posted 3 years ago
