[Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Unanswered

Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun or torch.distributed.run then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.

If however I branch out via torch.multiprocessing like below, everything works as expected. The “script path” just shows the single python script, all logs and scalars from all ranks are aggregated into a single task.

    task = Task.init("Glass-ClearML Demo", "Distributed basic mp.spawn, simple model, v3")
    n_gpus = torch.cuda.device_count()
    world_size = n_gpus
    mp.spawn(demo_basic, args=(world_size,), nprocs=world_size, join=True)

All code is taken from the Pytorch tutorial , I just add a ClearML Task into it as shown above.
ClearML version is 1.7.1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

311 Views

0 Answers

2 years ago