You mean you "aborted the task" from the UI?
Yes exactly
I'm assuming from the leftover processes ?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
yes in venv mode, I'll try with the latest version as well
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked 👍
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with pytorch (1.11) and ignite (0.4.8), using multiprocess (via the dataloader with n_workers=8) on linux, not running inside a docker container
Hi AgitatedDove14 , sorry somehow this message got lost 😄
clearml version is the latest at the time, 1.7.1
Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI don't know yet what to explore first. My first assumption is that this bug is due to a recent version of clearml-sdk/clearml-agent/python/pytorch (the training used to work smoothly couple of months ago). Now I get this problem on all my experiments.
Note: I think the memory is a consequence of the multiprocessing zombie bug, because on some experiments the memory grows but the experiment get stuck before reaching max of mem (see 2nd screenshot of datadog mem consumption). But that's just a hypothesis
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
Right now my next steps for debugging would be:
- Train without clearml integration -> If works, check which version of clearml sdk/agent is responsible
- Train with older version of the training code -> If works, look for guilty code changes
- Train with different python/pytorch version
There seems to be a problem with multiprocessing: Although I stopped the task,
You mean you "aborted the task" from the UI?
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
JitteryCoyote63 what's the clearml
version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?
Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?
What is latest rc of clearml-agent? 1.5.2rc0?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
These are most certainly dataloader process. But clearml-agent when killing the process should also kill all subprocesses, and it might be there is something going on that prenets it from killing the subprocesses ...
Is this easily reproducible ? Can you verify it is still the case with the latest RC of clearml-agent ?
Hmm, #790 should be solved in 1.7.2
Yes, I always see the "model uploaded completed" for such stuck tasksAny chance this is reproducible ?
How many processes do you see running (i.e. ps -Af | grep python) ?
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
Any insight will help, if you can provide the log of the Task that did get stuck, that would be a good start