Unanswered
Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model.
It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments).
The Symptoms
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI don't know yet what to explore first. My first assumption is that this bug is due to a recent version of clearml-sdk/clearml-agent/python/pytorch (the training used to work smoothly couple of months ago). Now I get this problem on all my experiments.
Note: I think the memory is a consequence of the multiprocessing zombie bug, because on some experiments the memory grows but the experiment get stuck before reaching max of mem (see 2nd screenshot of datadog mem consumption). But that's just a hypothesis
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
Right now my next steps for debugging would be:
- Train without clearml integration -> If works, check which version of clearml sdk/agent is responsible
- Train with older version of the training code -> If works, look for guilty code changes
- Train with different python/pytorch version
159 Views
0
Answers
one year ago
one year ago