Hi, If I Am Starting My Training With The Following Command:

I am actually calling later in the start_training function the following:
with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)So my backend should be nccl and not gloo , right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2

Posted 3 years ago
3 years ago
2 years ago