Unanswered
Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME"
and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf =
task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
os.environ["LOCAL_RANK"] = str(current_conf["node_rank"] % args.gpus)
And when I set args.nodes=2, args.gpus=2, I have 4 tasks:
- first host, global rank = 0, local rank = 0
- second host, global rank = 1, local rank = 1
- third host, global rank = 2, local rank = 0
- fourth host, global rank = 3, local rank = 1
How do I fix this?
49 Views
0
Answers
5 months ago
5 months ago