GreasyPenguin14 GrittyKangaroo27 the new release contains a fix, could you verify it solves the issue in your scenario as well (there is now a smart timeout to detect the inconsistent state, that means the close/exit procedure might be delayed (10sec) instead of hanging in these specific rare scenarios)
AgitatedDove14 Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
Ubuntu 18.04 and python 3.6. the subprocess is done by subclassing multiprocessing.Process and then calling the .start() method
Hi CostlyOstrich36 , thanks for the quick reply.
I'm running that on a Nvidia DGX-A100 computer with Ubuntu 20.04.3 LTS installed, Python 3.8.10, and clearml=1.1.4.
The ClearML server is not of the latest version though, I used "docker-compose.yml" version 3.6 to launch it.
Thanks SolidSealion72 !
Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
This is interesting, I'm pretty sure it has something to do with the subprocess not "closing" properly (or too fast or something)
Let me see if I can reproduce
btw:# in another process
How do you spin the subprrocess, is it with Popen ?
also what's the OS and python version you are using?
Go to https://hub.gke2.mybinder.org/user/jupyterlab-jupyterlab-demo-0570jy0h/lab/tree/demo and run the jupyter notebook called "main.ipynb". I've ran the only cell in it for several times and now its stuck.
You can see the corresponding task at https://demoapp.trains.allegro.ai/projects/98cdb5ace38946a690daec8efd668a76/experiments/db0088e1207440319d80acc3ac89aafb/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=-last_update . There are now two tasks in a "Running" state, one of them is generated by another run that I've made in a terminal and terminated.
SolidSealion72 EcstaticGoat95 I'm hoping the issue is now resolved 🤞
can you verify with ?pip install git+
It does work about 50% of the times
EcstaticGoat95 what do you mean by "work about 50%" ? do you mean the other 50% it hangs ?
EcstaticGoat95 , I couldn't reproduce the issue with 1.1.4 and the provided code. I tried with and without task.close()
at the end. Can you please specify which OS / Python version you're using?
Hi GreasyPenguin14
This is what I did, but I could not reproduce the hang, how is this different from your code?
` from multiprocessing import Process
import numpy as np
from matplotlib import pyplot as plt
from clearml import Task, StorageManager
class MyProcess(Process):
def run(self):
# in another process
global logger
# Create a plot
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N)) ** 2 # 0 to 15 point radii
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
# Plot will be reported automatically
task.logger.report_matplotlib_figure(title="My Plot Title", series="My Plot Series", iteration=10, figure=plt)
Task.set_offline(True)
task = Task.init(
project_name='debug',
task_name='exit subprocess',
auto_connect_arg_parser=True,
auto_connect_streams=True,
auto_connect_frameworks=True,
auto_resource_monitoring=True,
)
parameters = dict(key='value')
task.connect_configuration(parameters)
logger = task.get_logger()
p = MyProcess()
p.start()
csv_file = StorageManager.get_local_copy(" ")
logger.report_table("table", "csv", iteration=0, csv=csv_file)
p.join()
task.close() `
Yes, that's what I observe on one machine. On another machine it hangs with a lower probability (1/6), but it still happens.
Quick update, I might have been able to reproduce the issue ( GreasyPenguin14 working "offline" is a great hack to accelerate debugging this issue, thank you!)
It seems it is related to the known and very annoying Python forking issue (and this is why changing to "spawn" method solves the issue):
https://bugs.python.org/issue6721
Long story short, in some cases when forking (i.e. ProcessPoolExecutor), python can copy locks in a "bad" state, this means that you can end up with a lock acquired by a process that died, from here it is quite obvious that at some point we will hang...
I think the only way to get around it, is with a few predefined timeouts, so that we do not end up hanging the main process.
I'll post here once a fix is pushed to GitHub for you guys to test
In the process MyProcess other processes are created via a ProcessPoolExecutor. In these processes calls to logger.report_matplotlib_figure are made, but I get the same issue when I remove these calls.
It looks like I don't have hanging issues when I use mp.set_start_method('spawn')
at the top of the script.
I don't have a fully reproducilble example that I can share, sorry for that
AgitatedDove14 I managed to reproduce on Ubuntu (but not on Windows):
Not every run gets stuck, sometimes it's 1 in 10 runs that gets stuck.
https://github.com/maor121/clearml-bug-reproduction
AgitatedDove14 GreasyPenguin14 Awesome!
I face the same problem.
When running the pipeline, some tasks that use multiprocessing would never be completed.
It does work about 50% of the times, so running the script several times may reveal the problem.
I tried running it in a fresh virtualenv with only clearml installed, and I see the same issue.
GreasyPenguin14
In the process MyProcess other processes are created via a ProcessPoolExecutor.
Hmm that is interesting, the sub-process has an additional ProcessPoolExecutor inside it ?
GrittyKangaroo27 if you can help with reproducible code that will be great (or any insight on reproducing the issue)
EcstaticGoat95 I can see the experiment but I cannot access the notebook (I get Binder inaccessible
)
Is this the exact script as here? https://clearml.slack.com/archives/CTK20V944/p1636536308385700?thread_ts=1634910855.059900&cid=CTK20V944
EcstaticGoat95 any chance you have an idea on how to reproduce? (even 1 out of 6 is a good start)
SolidSealion72 I'm able to reproduce, hurrah!
(and a fix is already being tested, I will keep you guys updated)
On another machine it gets stuck once every 6 runs on average.
Multiprocessing Bug
Hi guys, I'm having a similar issue with clearml 1.1.4, and I have written a small script reproducing it.
There are two scripts attached, one invoking the clearml.Task
and the other one is a module containing the multiprocessing code. The issue only happens when the multiprocessing code is in another file (module).
The behavior I observe is that the local execution is being stuck with the following messages and the task in the clearml server is being stuck in a "Running" state, even after I terminate the local execution.2021-11-10 10:54:03,066 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2021-11-10 10:54:27,945 - clearml.Task - INFO - Finished repository detection and package analysis
I tried adding an explicit invocation of taks.close()
with a print after it, and the code doesn't reach this print.
EcstaticGoat95 , thanks a lot! Will take a look 🙂
Tested with clearml 1.1.3 and I could not reproduce the issue 👍