VirtuousFish83 is the exit(1) called from the main process or a subprocess? Are you running it with an agent?
VirtuousFish83 Hi 🙂
What versions are you running with? ClearML, ClearML-Agent, Torch, Lightning. Which OS are they run on and with what python version.
Do you maybe have a snippet to play around with to try and reproduce the issue?
I'm not using clearml-agent here, I use clearml.Task.init.
The exit(1) (or raised exception) is from a subprocess.
clearml==1.1.3
torch==1.9.0+cu111, torchvision==0.10, lightning not installed
python3.8
debian 10
I will try reproducing with a smaller code, it was a training with detectron2 which uses torch.,multiprocessing.spawn and torch.distributed.init_process_group
https://github.com/facebookresearch/detectron2/blob/c47167e4ac236a36895c294735a908b75f659f96/tools/train_net.py#L163
https://github.com/facebookresearch/detectron2/blob/c47167e4ac236a36895c294735a908b75f659f96/detectron2/engine/launch.py#L27