Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Unanswered

Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:

There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI don't know yet what to explore first. My first assumption is that this bug is due to a recent version of clearml-sdk/clearml-agent/python/pytorch (the training used to work smoothly couple of months ago). Now I get this problem on all my experiments.
Note: I think the memory is a consequence of the multiprocessing zombie bug, because on some experiments the memory grows but the experiment get stuck before reaching max of mem (see 2nd screenshot of datadog mem consumption). But that's just a hypothesis

Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1

Right now my next steps for debugging would be:

Train without clearml integration -> If works, check which version of clearml sdk/agent is responsible
Train with older version of the training code -> If works, look for guilty code changes
Train with different python/pytorch version

  				
Posted 
	one year ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

170 Views

0 Answers

one year ago