Hi, If I Am Starting My Training With The Following Command:

Unanswered

I am actually calling later in the start_training function the following:
with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)So my backend should be nccl and not gloo , right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

182 Views

0 Answers

3 years ago

2 years ago