Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, We Use Clearml With A Torch.Distributed (Ddp, On Only 1 Machine But With Multiple Process) Training, And We Found That Clearml Intercepts And Changes The Exit Code Of Our Process (I.E. Exit(1) Does Not Exit 1 Anymore), And Torch.Multiprocessing.Spa

Hello, we use clearml with a torch.distributed (DDP, on only 1 machine but with multiple process) training, and we found that clearml intercepts and changes the exit code of our process (i.e. exit(1) does not exit 1 anymore), and torch.multiprocessing.spawn normally stops if any worker fails (exception, exit code != 0) but with clearml the workers silently fail and the whole training hangs indefinitely.
Has someone encountered a problem like this before ? Is it possible to disable the exit hooks/signal interception of clearml.Task ?

  
  
Posted 3 years ago
Votes Newest

Answers 3


VirtuousFish83 is the exit(1) called from the main process or a subprocess? Are you running it with an agent?

  
  
Posted 3 years ago

VirtuousFish83 Hi 🙂

What versions are you running with? ClearML, ClearML-Agent, Torch, Lightning. Which OS are they run on and with what python version.

Do you maybe have a snippet to play around with to try and reproduce the issue?

  
  
Posted 3 years ago

I'm not using clearml-agent here, I use clearml.Task.init.
The exit(1) (or raised exception) is from a subprocess.
clearml==1.1.3
torch==1.9.0+cu111, torchvision==0.10, lightning not installed
python3.8
debian 10
I will try reproducing with a smaller code, it was a training with detectron2 which uses torch.,multiprocessing.spawn and torch.distributed.init_process_group
https://github.com/facebookresearch/detectron2/blob/c47167e4ac236a36895c294735a908b75f659f96/tools/train_net.py#L163
https://github.com/facebookresearch/detectron2/blob/c47167e4ac236a36895c294735a908b75f659f96/detectron2/engine/launch.py#L27

  
  
Posted 3 years ago