Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone! I'M Trying To Use

Hi everyone! I'm trying to use task.launch_multi_node(nodes, devices=gpus, hide_children=True) in conjunction with pytorch-ligtning. I am using the latest version of clearml - 1.16.5. If I specify DDPStrategy(process_group_backend="nccl") as the strategy and set nodes>=2 , then an error occurs

[rank3]:     work = default_pg.broadcast([tensor], opts)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to 10.217.6.2<33411> failed : Software caused connection abort

One node and the nccl strategy works, the gloo strategy and several nodes also work. I did not have such an error 5 months ago.
image

  
  
Posted one day ago
Votes Newest

Answers


Hi @<1578555761724755968:profile|GrievingKoala83> ! Can you share the logs after setting NCCL_DEBUG=INFO of all the tasks? Also, did it work for you 5 months ago because you were on another clearml version? If it works with another version, can you share that version number?

  
  
Posted one day ago