Hi There,

Answered

Hi There,

Hi there,
I found a memory leak in Logger.report_matplotlib_figure . I was constantly running out of memory when training my models so I decided to spend some time to investigate. Here is the free memory graph when I comment out the logging of the matplotlib figure with the clearml Logger and when I don't. We can clearly see the difference in memory evolution.

Some context on my training code: this is pretty standard deep learning training code using pytorch-ignite. At the end of each validation iteration, I generate images of the model output using matplotlib and I log them to clearml. So a lot of matplotlib figures can get accumulated if they are not properly closed. I do it with plt.close("all")

UPD: I tried to replace report_matplotlib_figure with report_image (saving the figure as a png before with figure.savefig("plot.png") ), same problem

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 27

No worries, I'm just glad you managed to figure the source of the issue 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

it is agg

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is it exactly agg or something different?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

correct (i.e. not frontend)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok so what is the value that is set when it is run by the agent? agg ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

clearml doesn't change the matplotlib backend under the hood, right? Just making sure

if the agent is running it, of course it does 🙂 otherwise where is the automagic, it would break the moment you run it on a remote machine

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Thanks that helps!
Maybe we just create a copy of the plot that gets "stuck"

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Adding back clearml logging with matplotlib.use('agg') , uses more ram but not that suspicious

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ok interestingly using matplotlib.use('agg') it doesn't leak (idea from here )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

With a large enough number of iterations in the for loop, you should see the memory grow over time

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:

import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm

task = Task.init("Debug memory leak", "reproduce")

def plot_data():
    fig, ax = plt.subplots(1, 1)
    t = np.arange(0., 5., 0.2)
    ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
    return fig

for i in tqdm(range(1000), total=1000):
    fig = plot_data()
    task.get_logger().report_matplotlib_figure("debug", "debug", iteration=i, figure=fig, report_image=True)
    plt.close("all")

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hmm okay, this does point to a mem leak, any chance this is reproducible?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Ok no it only helps if as far as I don't log the figure.

you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi @<1523701066867150848:profile|JitteryCoyote63>

I found a memory leak

in

Logger.report_matplotlib_figure

Are you sure this is not Matplotlib leak but the Logger's fault ? I'm trying to think how we could create such a mem leak
wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

27 Answers

2 years ago