Hello, we use clearml with a torch.distributed (DDP, on only 1 machine but with multiple process) training, and we found that clearml intercepts and changes the exit code of our process (i.e. exit(1) does not exit 1 anymore), and torch.multiprocessing.spawn normally stops if any worker fails (exception, exit code != 0) but with clearml the workers silently fail and the whole training hangs indefinitely.
Has someone encountered a problem like this before ? Is it possible to disable the exit hooks/signal interception of clearml.Task ?

Posted 2 years ago
I'm not using clearml-agent here, I use clearml.Task.init.
The exit(1) (or raised exception) is from a subprocess.
torch==1.9.0+cu111, torchvision==0.10, lightning not installed
debian 10
I will try reproducing with a smaller code, it was a training with detectron2 which uses torch.,multiprocessing.spawn and torch.distributed.init_process_group

Posted 2 years ago

VirtuousFish83 Hi 🙂

What versions are you running with? ClearML, ClearML-Agent, Torch, Lightning. Which OS are they run on and with what python version.

Do you maybe have a snippet to play around with to try and reproduce the issue?

Posted 2 years ago

VirtuousFish83 is the exit(1) called from the main process or a subprocess? Are you running it with an agent?

Posted 2 years ago