No worries, I'm just glad you managed to figure the source of the issue 🙂
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw
Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here
Is it exactly agg
or something different?
Ok so what is the value that is set when it is run by the agent? agg
?
clearml doesn't change the matplotlib backend under the hood, right? Just making sure
if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine
clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
Ok interestingly using matplotlib.use('agg')
it doesn't leak (idea from here )
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
With a large enough number of iterations in the for loop, you should see the memory grow over time
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
plt.close("all")
Hmm okay, this does point to a mem leak, any chance this is reproducible?
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
Hi @<1523701066867150848:profile|JitteryCoyote63>
I found a memory leak
in
Logger.report_matplotlib_figure
Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?