Reputation
Badges 1
979 × Eureka!Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking ๐
RuntimeError: CUDA error: no kernel image is available for execution on the device
For me it is definitely reproducible ๐ But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
With a large enough number of iterations in the for loop, you should see the memory grow over time
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
automatically promote models to be served from within clearml
Yes!
So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get igniteโs events fired, but still no scalars logged ๐
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
I fixed, will push a fix in pytorch-ignite ๐
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
Thanks AgitatedDove14 !
Could we add this task.refresh()
on the docs? Might be helpful for other users as well ๐ OK! Maybe there is a middle ground: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem ๐
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
mmmh good point actually, I didnโt think about it
Because it lives behind a VPN and github workers donโt have access to it
Iโd like to move to a setup where I donโt need these tricks