Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone! I'M Trying To Use

Hi everyone! I'm trying to use task.launch_multi_node(nodes, devices=gpus, hide_children=True) in conjunction with pytorch-ligtning. I am using the latest version of clearml - 1.16.5. If I specify DDPStrategy(process_group_backend="nccl") as the strategy and set nodes>=2 , then an error occurs

[rank3]:     work = default_pg.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 10.217.6.2<33411> failed : Software caused connection abort

One node and the nccl strategy works, the gloo strategy and several nodes also work. I did not have such an error 5 months ago.
image

  
  
Posted one day ago
Votes Newest

Answers 3


Hi @<1578555761724755968:profile|GrievingKoala83> ! The only way I see this error appearing is:

  • your process gets forked while launch_multi_node is called
  • there has been a network error when receiving the response to Task.enqueue, then the call has been retried, resulting in this errorCan you verify one or the other?
  
  
Posted 15 hours ago

@<1523701435869433856:profile|SmugDolphin23> Everything worked after setting the variables: --env NCCL_IB_DISABLE=1 --env NCCL_SOCKET_IFNAME=ens192 --env NCCL_P2P_DISABLE=1. But previously, these variables were not required for a successful launch. When I run ddp training with two nodes , everything works for me now. But as soon as I increase their number ( nodes > 2 ), I get the following error.

Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.11/code/lightning_ddp_rc.py", line 104, in <module>
    config = task.launch_multi_node(nodes, devices=gpus, hide_children=True, wait=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 2195, in launch_multi_node
    Task.enqueue(node, queue_id=self.data.execution.queue)
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 1557, in enqueue
    raise exception
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/task.py", line 1548, in enqueue
    res = cls._send(session=session, req=req)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/backend_interface/base.py", line 107, in _send
    raise SendError(res, error_msg)
clearml.backend_interface.session.SendError: Action failed <400/706: tasks.enqueue/v1.0 (Failed adding task to queue since task is already queued: task=88808574c7c648ac97bd18303c230710)> (queue=1f0eee180f3d43ddbb432badf328e85b, task=88808574c7c648ac97bd18303c230710, verify_watched_queue=False)
2024-12-04 10:42:25
Process failed, exit code 1
  
  
Posted 18 hours ago

Hi @<1578555761724755968:profile|GrievingKoala83> ! Can you share the logs after setting NCCL_DEBUG=INFO of all the tasks? Also, did it work for you 5 months ago because you were on another clearml version? If it works with another version, can you share that version number?

  
  
Posted one day ago