Reputation
Badges 1
979 × Eureka!AgitatedDove14 WOW, thanks a lot! I will dig into that π
It broke the shift holding to select multiple experiments btw
I am also interested in the clearml-serving part π
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats π π Was this bug fixed with this new version?
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
v0.17.5rc2
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking π
RuntimeError: CUDA error: no kernel image is available for execution on the device
For me it is definitely reproducible π But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
With a large enough number of iterations in the for loop, you should see the memory grow over time
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
automatically promote models to be served from within clearml
Yes!
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get igniteβs events fired, but still no scalars logged π
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...