Reputation
Badges 1
979 × Eureka!Would you like me to open an issue for that or will you fix it?
That would be amazing!
AgitatedDove14 WOW, thanks a lot! I will dig into that 🚀
It broke the shift holding to select multiple experiments btw
I am also interested in the clearml-serving part 😄
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats 🚀 🙌 Was this bug fixed with this new version?
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
v0.17.5rc2
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
RuntimeError: CUDA error: no kernel image is available for execution on the device
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!