Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
clearml doesn't change the matplotlib backend under the hood, right? Just making sure
if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Ok so what is the value that is set when it is run by the agent? agg
?
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
With a large enough number of iterations in the for loop, you should see the memory grow over time
No worries, I'm just glad you managed to figure the source of the issue 🙂
Ok interestingly using matplotlib.use('agg')
it doesn't leak (idea from here )
Is it exactly agg
or something different?
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
plt.close("all")
Hi @<1523701066867150848:profile|JitteryCoyote63>
I found a memory leak
in
Logger.report_matplotlib_figure
Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
Hmm okay, this does point to a mem leak, any chance this is reproducible?
clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄