Hi, And Thanks For The Great System. I'Ve Been Training Using

Answered

Hi, and thanks for the great system.
I've been training using https://github.com/facebookresearch/detectron2 (PyTorch framework) with logging in Tensorboard, and I'm trying to switch to ClearML. Training with a single GPU works fine with ClearML and the reporting looks fine.
When I try to use multiple GPUs (Detectron2 uses DistributedDataParallel) I encounter some problems. First, when I start training I get this:
Exception in thread Thread-3: Traceback (most recent call last): File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/utilities/process/mp.py", line 514, in _daemon self.daemon() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/backend_interface/metrics/reporter.py", line 138, in daemon self._res_waiting.acquire() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/utilities/process/mp.py", line 87, in acquire self._create() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/site-packages/clearml/utilities/process/mp.py", line 53, in _create self._sync = self._functor() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/context.py", line 82, in Semaphore return Semaphore(value, ctx=self.get_context()) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/synchronize.py", line 126, in __init__ SemLock.__init__(self, SEMAPHORE, value, SEM_VALUE_MAX, ctx=ctx) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__ register(self._semlock.name) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register self._send('REGISTER', name) File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send self.ensure_running() File "/homes/pazb/.conda/envs/dtron_cud_devkit/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running pid, status = os.waitpid(self._pid, os.WNOHANG) ChildProcessError: [Errno 10] No child processesThe training proceeds, but the scalars I write using Logger.current_logger().report_scalar(...) don't get flushed and don't get to the server/UI.
Also, at some point all but one GPU stop, and one goes to 100% utilization. I also see this print:
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

I'm using:
clearml=1.1.6 detectron2=0.6+cu102 pytorch=1.10.2=py3.7_cuda10.2_cudnn7.6.5_0 torchvision=0.11.3=py37_cu102
Any idea what could be causing this?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StickyWhale51
				
					0
					 × 1

Votes Newest

Answers 5

Great! btw: final v1.2.0 should be out after the weekend

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm running locally, yes. I have 8 GPUs Here's a MWE of the https://raw.githubusercontent.com/allegroai/clearml/master/examples/frameworks/pytorch/pytorch_distributed_example.py combined with https://pytorch.org/tutorials/intermediate/ddp_tutorial.html :https://gist.github.com/pazbunis/97a65adbab073dcdcb90954e4e346892

Much appreciated!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StickyWhale51
				
					0
					 × 1

Hi StickyWhale51
I think this issue is due to some internal race condition, anyhow I think we have an RC out solving it, can you try with:
pip install clearml==1.2.0rc2

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is this all happening when you're running locally? How many gpu's do you have/try to run on? Also, can you provide an example code snippet to try and run something basic to get a similar failure. I think I have a machine with multiple gpus that I can try playing on 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

AgitatedDove14 it works as expected in 1.2.0rc2. Thanks! 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StickyWhale51
				
					0
					 × 1

Write your answer

2K Views

5 Answers

3 years ago

2 years ago