Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi There,

Hi there,
I found a memory leak in Logger.report_matplotlib_figure . I was constantly running out of memory when training my models so I decided to spend some time to investigate. Here is the free memory graph when I comment out the logging of the matplotlib figure with the clearml Logger and when I don't. We can clearly see the difference in memory evolution.

Some context on my training code: this is pretty standard deep learning training code using pytorch-ignite. At the end of each validation iteration, I generate images of the model output using matplotlib and I log them to clearml. So a lot of matplotlib figures can get accumulated if they are not properly closed. I do it with plt.close("all")

UPD: I tried to replace report_matplotlib_figure with report_image (saving the figure as a png before with figure.savefig("plot.png") ), same problem
image
image

  
  
Posted 2 years ago
Votes Newest

Answers 27


Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem

  
  
Posted 2 years ago

I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet

  
  
Posted 2 years ago

This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
image

  
  
Posted 2 years ago

If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak

  
  
Posted 2 years ago

Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak

  
  
Posted 2 years ago

Ok interestingly using matplotlib.use('agg') it doesn't leak (idea from here )
image

  
  
Posted 2 years ago

Is it exactly agg or something different?

  
  
Posted 2 years ago

it is agg

  
  
Posted 2 years ago

Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:

  
  
Posted 2 years ago

Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit

  
  
Posted 2 years ago

Hmm okay, this does point to a mem leak, any chance this is reproducible?

  
  
Posted 2 years ago

Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw

  
  
Posted 2 years ago

Hi @<1523701066867150848:profile|JitteryCoyote63>

I found a memory leak

in

Logger.report_matplotlib_figure

Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?

  
  
Posted 2 years ago

Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them

  
  
Posted 2 years ago

With a large enough number of iterations in the for loop, you should see the memory grow over time

  
  
Posted 2 years ago

Ok so what is the value that is set when it is run by the agent? agg ?

  
  
Posted 2 years ago

Ok no it only helps if as far as I don't log the figure.

you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?

  
  
Posted 2 years ago

correct (i.e. not frontend)

  
  
Posted 2 years ago

Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why

  
  
Posted 2 years ago

clearml doesn't change the matplotlib backend under the hood, right? Just making sure

if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine

  
  
Posted 2 years ago

clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄

  
  
Posted 2 years ago

No worries, I'm just glad you managed to figure the source of the issue 🙂

  
  
Posted 2 years ago

For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:

import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm

task = Task.init("Debug memory leak", "reproduce")

def plot_data():
    fig, ax = plt.subplots(1, 1)
    t = np.arange(0., 5., 0.2)
    ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
    return fig

for i in tqdm(range(1000), total=1000):
    fig = plot_data()
    task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
    plt.close("all")
  
  
Posted 2 years ago

Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"

  
  
Posted 2 years ago

Adding back clearml logging with matplotlib.use('agg') , uses more ram but not that suspicious
image

  
  
Posted 2 years ago

Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄

  
  
Posted 2 years ago

Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here

  
  
Posted 2 years ago
2K Views
27 Answers
2 years ago
2 years ago
Tags
Similar posts