Reputation
Badges 1
981 × Eureka!Hi SuccessfulKoala55 , super that’s what I was looking for
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didn’t update the apiserver.auth.cookies.domain setting - I did it, restarted and now it works 🙂 Thanks!
The task is created using Task.clone() yes
AgitatedDove14 , my “uncommitted changes” ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
because I cannot locate libcudart or because cudnn_version = 0?
Thanks! Unfortunately still not working, here is the log file:
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
(Btw the instance listed in the console has no name, it it normal?)
Still getting the same error, it is not taken into account 🤔
and this works. However, without the trick from UnevenDolphin73 , the following won’t work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instance…
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
edited the aws_auto_scaler.py, actually I think it’s just a typo, I just need to double the brackets
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
I will let the team answer you on that one 🙂
ProxyDictPostWrite._to_dict() will recursively convert to dict and OmegaConf will not complain then
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
So I created a symlink in /opt/train/data -> /data