Hi @<1523701066867150848:profile|JitteryCoyote63>
I found a memory leak
in
Logger.report_matplotlib_figure
Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?
Ok interestingly using matplotlib.use('agg')
it doesn't leak (idea from here )
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
clearml doesn't change the matplotlib backend under the hood, right? Just making sure
if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
Ok so what is the value that is set when it is run by the agent? agg
?
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
Is it exactly agg
or something different?
No worries, I'm just glad you managed to figure the source of the issue 🙂
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
Hmm okay, this does point to a mem leak, any chance this is reproducible?
Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
plt.close("all")
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here
clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄
With a large enough number of iterations in the for loop, you should see the memory grow over time
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"