Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi There,

Hi there,
I found a memory leak in Logger.report_matplotlib_figure . I was constantly running out of memory when training my models so I decided to spend some time to investigate. Here is the free memory graph when I comment out the logging of the matplotlib figure with the clearml Logger and when I don't. We can clearly see the difference in memory evolution.

Some context on my training code: this is pretty standard deep learning training code using pytorch-ignite. At the end of each validation iteration, I generate images of the model output using matplotlib and I log them to clearml. So a lot of matplotlib figures can get accumulated if they are not properly closed. I do it with plt.close("all")

UPD: I tried to replace report_matplotlib_figure with report_image (saving the figure as a png before with figure.savefig("plot.png") ), same problem
image
image

  
  
Posted 10 months ago
Votes Newest

Answers 27


Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here

  
  
Posted 10 months ago

Ok interestingly using matplotlib.use('agg') it doesn't leak (idea from here )
image

  
  
Posted 10 months ago

Ok no it only helps if as far as I don't log the figure.

you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?

  
  
Posted 10 months ago

Ok so what is the value that is set when it is run by the agent? agg ?

  
  
Posted 10 months ago

With a large enough number of iterations in the for loop, you should see the memory grow over time

  
  
Posted 10 months ago

No worries, I'm just glad you managed to figure the source of the issue 🙂

  
  
Posted 10 months ago

For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:

import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm

task = Task.init("Debug memory leak", "reproduce")

def plot_data():
    fig, ax = plt.subplots(1, 1)
    t = np.arange(0., 5., 0.2)
    ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
    return fig

for i in tqdm(range(1000), total=1000):
    fig = plot_data()
    task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
    plt.close("all")
  
  
Posted 10 months ago

I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet

  
  
Posted 10 months ago

clearml doesn't change the matplotlib backend under the hood, right? Just making sure

if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine

  
  
Posted 10 months ago

Hmm okay, this does point to a mem leak, any chance this is reproducible?

  
  
Posted 10 months ago

Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them

  
  
Posted 10 months ago

Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄

  
  
Posted 10 months ago

Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:

  
  
Posted 10 months ago

it is agg

  
  
Posted 10 months ago

Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw

  
  
Posted 10 months ago

Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"

  
  
Posted 10 months ago

Hi @<1523701066867150848:profile|JitteryCoyote63>

I found a memory leak

in

Logger.report_matplotlib_figure

Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?

  
  
Posted 10 months ago

clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄

  
  
Posted 10 months ago

Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit

  
  
Posted 10 months ago

Is it exactly agg or something different?

  
  
Posted 10 months ago

Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem

  
  
Posted 10 months ago

This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
image

  
  
Posted 10 months ago

Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why

  
  
Posted 10 months ago

Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak

  
  
Posted 10 months ago

Adding back clearml logging with matplotlib.use('agg') , uses more ram but not that suspicious
image

  
  
Posted 10 months ago

If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak

  
  
Posted 10 months ago

correct (i.e. not frontend)

  
  
Posted 10 months ago
476 Views
27 Answers
10 months ago
10 months ago
Tags
Similar posts