This is what I did, but I could not reproduce the hang, how is this different from your code?
` from multiprocessing import Process
import numpy as np
from matplotlib import pyplot as plt
from clearml import Task, StorageManager
# in another process
# Create a plot
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N)) ** 2 # 0 to 15 point radii
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
# Plot will be reported automatically
task.logger.report_matplotlib_figure(title="My Plot Title", series="My Plot Series", iteration=10, figure=plt)
task = Task.init(
parameters = dict(key='value')
logger = task.get_logger()
p = MyProcess()
csv_file = StorageManager.get_local_copy("
logger.report_table("table", "csv", iteration=0, csv=csv_file)
In the process MyProcess other processes are created via a ProcessPoolExecutor. In these processes calls to logger.report_matplotlib_figure are made, but I get the same issue when I remove these calls.
It looks like I don't have hanging issues when I use
mp.set_start_method('spawn') at the top of the script.
I don't have a fully reproducilble example that I can share, sorry for that
# in another process
How do you spin the subprrocess, is it with Popen ?
also what's the OS and python version you are using?
Ubuntu 18.04 and python 3.6. the subprocess is done by subclassing multiprocessing.Process and then calling the .start() method
Quick update, I might have been able to reproduce the issue ( GreasyPenguin14 working "offline" is a great hack to accelerate debugging this issue, thank you!)
It seems it is related to the known and very annoying Python forking issue (and this is why changing to "spawn" method solves the issue):
Long story short, in some cases when forking (i.e. ProcessPoolExecutor), python can copy locks in a "bad" state, this means that you can end up with a lock acquired by a process that died, from here it is quite obvious that at some point we will hang...
I think the only way to get around it, is with a few predefined timeouts, so that we do not end up hanging the main process.
I'll post here once a fix is pushed to GitHub for you guys to test
GreasyPenguin14 GrittyKangaroo27 the new release contains a fix, could you verify it solves the issue in your scenario as well (there is now a smart timeout to detect the inconsistent state, that means the close/exit procedure might be delayed (10sec) instead of hanging in these specific rare scenarios)
In the process MyProcess other processes are created via a ProcessPoolExecutor.
Hmm that is interesting, the sub-process has an additional ProcessPoolExecutor inside it ?
GrittyKangaroo27 if you can help with reproducible code that will be great (or any insight on reproducing the issue)
I face the same problem.
When running the pipeline, some tasks that use multiprocessing would never be completed.
Hi guys, I'm having a similar issue with clearml 1.1.4, and I have written a small script reproducing it.
There are two scripts attached, one invoking the
clearml.Task and the other one is a module containing the multiprocessing code. The issue only happens when the multiprocessing code is in another file (module).
The behavior I observe is that the local execution is being stuck with the following messages and the task in the clearml server is being stuck in a "Running" state, even after I terminate the local execution.
2021-11-10 10:54:03,066 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2021-11-10 10:54:27,945 - clearml.Task - INFO - Finished repository detection and package analysisI tried adding an explicit invocation of
taks.close() with a print after it, and the code doesn't reach this print.
Hi CostlyOstrich36 , thanks for the quick reply.
I'm running that on a Nvidia DGX-A100 computer with Ubuntu 20.04.3 LTS installed, Python 3.8.10, and clearml=1.1.4.
The ClearML server is not of the latest version though, I used "docker-compose.yml" version 3.6 to launch it.
EcstaticGoat95 , I couldn't reproduce the issue with 1.1.4 and the provided code. I tried with and without
task.close() at the end. Can you please specify which OS / Python version you're using?
EcstaticGoat95 , thanks a lot! Will take a look 🙂
AgitatedDove14 GreasyPenguin14 Awesome!
It does work about 50% of the times, so running the script several times may reveal the problem.
I tried running it in a fresh virtualenv with only clearml installed, and I see the same issue.
Yes, that's what I observe on one machine. On another machine it hangs with a lower probability (1/6), but it still happens.
It does work about 50% of the times
EcstaticGoat95 what do you mean by "work about 50%" ? do you mean the other 50% it hangs ?
On another machine it gets stuck once every 6 runs on average.
Go to https://hub.gke2.mybinder.org/user/jupyterlab-jupyterlab-demo-0570jy0h/lab/tree/demo and run the jupyter notebook called "main.ipynb". I've ran the only cell in it for several times and now its stuck.
You can see the corresponding task at https://demoapp.trains.allegro.ai/projects/98cdb5ace38946a690daec8efd668a76/experiments/db0088e1207440319d80acc3ac89aafb/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=-last_update . There are now two tasks in a "Running" state, one of them is generated by another run that I've made in a terminal and terminated.
EcstaticGoat95 I can see the experiment but I cannot access the notebook (I get
Is this the exact script as here? https://clearml.slack.com/archives/CTK20V944/p1636536308385700?thread_ts=1634910855.059900&cid=CTK20V944
EcstaticGoat95 any chance you have an idea on how to reproduce? (even 1 out of 6 is a good start)
Thank you EcstaticGoat95 !
Thanks SolidSealion72 !
Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
This is interesting, I'm pretty sure it has something to do with the subprocess not "closing" properly (or too fast or something)
Let me see if I can reproduce
AgitatedDove14 I managed to reproduce on Ubuntu (but not on Windows):
Not every run gets stuck, sometimes it's 1 in 10 runs that gets stuck.
AgitatedDove14 Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
SolidSealion72 EcstaticGoat95 I'm hoping the issue is now resolved 🤞
can you verify with ?
pip install git+
SolidSealion72 I'm able to reproduce, hurrah!
(and a fix is already being tested, I will keep you guys updated)
Yes, that seems to solve the issue. Thanks.
Tested with clearml 1.1.3 and I could not reproduce the issue 👍