Unanswered
[Clearml With Pytorch-Based Distributed Training}
Hi Everyone! Is The Combination Of Clearml With
Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun
or torch.distributed.run
then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.
If however I branch out via torch.multiprocessing
like below, everything works as expected. The “script path” just shows the single python script, all logs and scalars from all ranks are aggregated into a single task.
task = Task.init("Glass-ClearML Demo", "Distributed basic mp.spawn, simple model, v3")
n_gpus = torch.cuda.device_count()
world_size = n_gpus
mp.spawn(demo_basic, args=(world_size,), nprocs=world_size, join=True)
All code is taken from the Pytorch tutorial , I just add a ClearML Task into it as shown above.
ClearML version is 1.7.1
190 Views
0
Answers
one year ago
one year ago