Ok so what is the value that is set when it is run by the agent? agg
?
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Hmm okay, this does point to a mem leak, any chance this is reproducible?
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
plt.close("all")
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Is it exactly agg
or something different?
No worries, I'm just glad you managed to figure the source of the issue 🙂
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
Ok interestingly using matplotlib.use('agg')
it doesn't leak (idea from here )
clearml doesn't change the matplotlib backend under the hood, right? Just making sure
if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine
clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here
Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw
Hi @<1523701066867150848:profile|JitteryCoyote63>
I found a memory leak
in
Logger.report_matplotlib_figure
Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
With a large enough number of iterations in the for loop, you should see the memory grow over time
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"