Unanswered
Hi, If I Am Starting My Training With The Following Command:
I am actually calling later in the start_training
function the following:with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)
So my backend should be nccl
and not gloo
, right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2
132 Views
0
Answers
2 years ago
one year ago