Hi Everyone! I'M Trying To Use

Answered

Hi everyone! I'm trying to use task.launch_multi_node(nodes, devices=gpus, hide_children=True) in conjunction with pytorch-ligtning. I am using the latest version of clearml - 1.16.5. If I specify DDPStrategy(process_group_backend="nccl") as the strategy and set nodes>=2 , then an error occurs

[rank3]:     work = default_pg.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 10.217.6.2<33411> failed : Software caused connection abort

One node and the nccl strategy works, the gloo strategy and several nodes also work. I did not have such an error 5 months ago.

  				
Posted 
	3 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Votes Newest

Answers 5

SmugDolphin23 Everything worked after setting the variables: --env NCCL_IB_DISABLE=1 --env NCCL_SOCKET_IFNAME=ens192 --env NCCL_P2P_DISABLE=1. But previously, these variables were not required for a successful launch. When I run ddp training with two nodes , everything works for me now. But as soon as I increase their number ( nodes > 2 ), I get the following error.

Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.11/code/lightning_ddp_rc.py", line 104, in <module>
    config = task.launch_multi_node(nodes, devices=gpus, hide_children=True, wait=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 2195, in launch_multi_node
    Task.enqueue(node, queue_id=self.data.execution.queue)
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 1557, in enqueue
    raise exception
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 1548, in enqueue
    res = cls._send(session=session, req=req)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/backend_interface/base.py", line 107, in _send
    raise SendError(res, error_msg)
clearml.backend_interface.session.SendError: Action failed <400/706: tasks.enqueue/v1.0 (Failed adding task to queue since task is already queued: task=88808574c7c648ac97bd18303c230710)> (queue=1f0eee180f3d43ddbb432badf328e85b, task=88808574c7c648ac97bd18303c230710, verify_watched_queue=False)
2024-12-04 10:42:25
Process failed, exit code 1

  				
Posted 
	3 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 It is possible to request up to 5 workers in the toy example with Feed Forward and MNIST, BUT it is not possible to request more than 2 workers on a real large model

  				
Posted 
	3 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi GrievingKoala83 ! The only way I see this error appearing is:

your process gets forked while launch_multi_node is called
there has been a network error when receiving the response to Task.enqueue, then the call has been retried, resulting in this errorCan you verify one or the other?

  				
Posted 
	3 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

SmugDolphin23 This error occurs when a secondary task is created with launch_multi_node. And this error disappears when I add the reuse_last_task_id=False flag when initializing the task. But now I have a new problem. I can't request more than 2 nodes. The training logs freezes after several iterations of first epoch with three workers. And if i request four workers i get this error:

DEBUG Epoch 0:   8%|▊         | 200/2484 [04:43<53:55,  0.71it/s, v_num=0]

DEBUG [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3532, OpType=ALLREDUCE, NumelIn=9445896, NumelOut=9445896, Timeout(ms)=1800000) ran for 1800574 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:523] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3532, OpType=ALLREDUCE, NumelIn=9445896, NumelOut=9445896, Timeout(ms)=1800000) ran for 1800625 milliseconds before timing out.
[rank4]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3532, OpType=ALLREDUCE, NumelIn=9445896, NumelOut=9445896, Timeout(ms)=1800000) ran for 1800574 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f785391dd87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7854a98f66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7854a9c4bd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7854a9d0b9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f78a251fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f78a4172609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f78a42ac133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E ProcessGroupNCCL.cpp:1182] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3532, OpType=ALLREDUCE, NumelIn=9445896, NumelOut=9445896, Timeout(ms)=1800000) ran for 1800625 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55097a1d87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f550a91cf66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f550a9204bd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f550a9210b9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f55583a3df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f5559ff6609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f555a130133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendError'
'
  what():    what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3532, OpType=ALLREDUCE, NumelIn=9445896, NumelOut=9445896, Timeout(ms)=1800000) ran for 1800574 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f785391dd87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7854a98f66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7854a9c4bd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7854a9d0b9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f78a251fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f78a4172609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f78a42ac133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f785391dd87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdcc083 (0x7f78547f5083 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f78a251fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f78a4172609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f78a42ac133 in /lib/x86_64-linux-gnu/libc.so.6)
[Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3532, OpType=ALLREDUCE, NumelIn=9445896, NumelOut=9445896, Timeout(ms)=1800000) ran for 1800625 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55097a1d87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f550a91cf66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f550a9204bd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f550a9210b9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f55583a3df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f5559ff6609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f555a130133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55097a1d87 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdcc083 (0x7f550a679083 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f55583a3df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f5559ff6609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f555a130133 in /lib/x86_64-linux-gnu/libc.so.6)


[rank: 1] Child process with PID 853 terminated with code -6. Forcefully terminating all other processes to avoid zombies 🧟

  				
Posted 
	3 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi GrievingKoala83 ! Can you share the logs after setting NCCL_DEBUG=INFO of all the tasks? Also, did it work for you 5 months ago because you were on another clearml version? If it works with another version, can you share that version number?

  				
Posted 
	3 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

338 Views

5 Answers

3 months ago