Reputation
Badges 1
979 × Eureka!Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking π
RuntimeError: CUDA error: no kernel image is available for execution on the device
For me it is definitely reproducible π But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
With a large enough number of iterations in the for loop, you should see the memory grow over time
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
automatically promote models to be served from within clearml
Yes!
So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get igniteβs events fired, but still no scalars logged π
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
I fixed, will push a fix in pytorch-ignite π
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
Thanks AgitatedDove14 !
Could we add this task.refresh()
on the docs? Might be helpful for other users as well π OK! Maybe there is a middle ground: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem π
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
mmmh good point actually, I didnβt think about it
Because it lives behind a VPN and github workers donβt have access to it
Iβd like to move to a setup where I donβt need these tricks
is it different from Task.set_offline(True)?
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
and just run the same code I run production
even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field