Hi Everyone! I'M Trying To Use

Answered

Hi everyone! I'm trying to use task.launch_multi_node(nodes, devices=gpus, hide_children=True) in conjunction with pytorch-ligtning. I am using the latest version of clearml - 1.16.5. If I specify DDPStrategy(process_group_backend="nccl") as the strategy and set nodes>=2 , then an error occurs

[rank3]:     work = default_pg.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 10.217.6.2<33411> failed : Software caused connection abort

One node and the nccl strategy works, the gloo strategy and several nodes also work. I did not have such an error 5 months ago.

  				
Posted 
	one day ago

					More
				  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Votes Newest

Answers 3

Hi @<1578555761724755968:profile|GrievingKoala83> ! Can you share the logs after setting NCCL_DEBUG=INFO of all the tasks? Also, did it work for you 5 months ago because you were on another clearml version? If it works with another version, can you share that version number?

  				
Posted 
	one day ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

@<1523701435869433856:profile|SmugDolphin23> Everything worked after setting the variables: --env NCCL_IB_DISABLE=1 --env NCCL_SOCKET_IFNAME=ens192 --env NCCL_P2P_DISABLE=1. But previously, these variables were not required for a successful launch. When I run ddp training with two nodes , everything works for me now. But as soon as I increase their number ( nodes > 2 ), I get the following error.

Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.11/code/lightning_ddp_rc.py", line 104, in <module>
    config = task.launch_multi_node(nodes, devices=gpus, hide_children=True, wait=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 2195, in launch_multi_node
    Task.enqueue(node, queue_id=self.data.execution.queue)
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 1557, in enqueue
    raise exception
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 1548, in enqueue
    res = cls._send(session=session, req=req)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/backend_interface/base.py", line 107, in _send
    raise SendError(res, error_msg)
clearml.backend_interface.session.SendError: Action failed <400/706: tasks.enqueue/v1.0 (Failed adding task to queue since task is already queued: task=88808574c7c648ac97bd18303c230710)> (queue=1f0eee180f3d43ddbb432badf328e85b, task=88808574c7c648ac97bd18303c230710, verify_watched_queue=False)
2024-12-04 10:42:25
Process failed, exit code 1

  				
Posted 
	18 hours ago

					More
				  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi @<1578555761724755968:profile|GrievingKoala83> ! The only way I see this error appearing is:

your process gets forked while launch_multi_node is called
there has been a network error when receiving the response to Task.enqueue, then the call has been retried, resulting in this errorCan you verify one or the other?

  				
Posted 
	15 hours ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

5 Views

3 Answers

one day ago

2 hours ago