Hi @<1578555761724755968:profile|GrievingKoala83> ! Can you share the logs after setting NCCL_DEBUG=INFO
of all the tasks? Also, did it work for you 5 months ago because you were on another clearml version? If it works with another version, can you share that version number?
Answered
Hi Everyone! I'M Trying To Use
Hi everyone! I'm trying to use task.launch_multi_node(nodes, devices=gpus, hide_children=True) in conjunction with pytorch-ligtning. I am using the latest version of clearml - 1.16.5. If I specify DDPStrategy(process_group_backend="nccl") as the strategy and set nodes>=2 , then an error occurs
[rank3]: work = default_pg.broadcast([tensor], opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 10.217.6.2<33411> failed : Software caused connection abort
One node and the nccl strategy works, the gloo strategy and several nodes also work. I did not have such an error 5 months ago.
4 Views
1
Answer
one day ago
19 hours ago
Tags
Similar posts