Hi, and thanks for the great system.
I've been training using https://github.com/facebookresearch/detectron2 (PyTorch framework) with logging in Tensorboard, and I'm trying to switch to ClearML. Training with a single GPU works fine with ClearML and the reporting looks fine.
When I try to use multiple GPUs (Detectron2 uses DistributedDataParallel) I encounter some problems. First, when I start training I get this:
Exception in thread Thread-3: Traceback (most recent call last): File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/utilities/process/mp.py", line 514, in _daemon self.daemon() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/backend_interface/metrics/reporter.py", line 138, in daemon self._res_waiting.acquire() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/utilities/process/mp.py", line 87, in acquire self._create() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/utilities/process/mp.py", line 53, in _create self._sync = self._functor() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/context.py", line 82, in Semaphore return Semaphore(value, ctx=self.get_context()) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/synchronize.py", line 126, in __init__ SemLock.__init__(self, SEMAPHORE, value, SEM_VALUE_MAX, ctx=ctx) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__ register(self._semlock.name) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register self._send('REGISTER', name) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send self.ensure_running() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running pid, status = os.waitpid(self._pid, os.WNOHANG) ChildProcessError: [Errno 10] No child processes
The training proceeds, but the scalars I write using Logger.current_logger().report_scalar(...)
don't get flushed and don't get to the server/UI.
Also, at some point all but one GPU stop, and one goes to 100% utilization. I also see this print:
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
I'm using:
clearml=1.1.6 detectron2=0.6+cu102 pytorch=1.10.2=py3.7_cuda10.2_cudnn7.6.5_0 torchvision=0.11.3=py37_cu102
Any idea what could be causing this?