JitteryCoyote63

214 Questions, 1021 Answers

Active since 10 January 2023

Last activity 7 months ago

Reputation

Badges 1

979 × Eureka!

Questions 214
Answers 1021

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hi Again, My Clearml Api-Server Is Having A Memory Leak. Each Time I Restart It, Its Ram Consumption Grows Until Getting Oom, Is Not Killed And Make The Ec2 Instance Crash

Hi again, my clearml api-server is having a memory leak. Each time I restart it, its ram consumption grows until getting OOM, is not killed and make the ec2 ...

clearml

3 years ago

0 Votes

23 Answers

941 Views

0 Votes 23 Answers 941 Views

Hi, I Would Like To Bring Awareness

Hi, I would like to bring awareness on this issue , this impacts my work as I cannot install the older version of torch (1.11.0)

clearml

one year ago

0 Votes

17 Answers

1K Views

0 Votes 17 Answers 1K Views

Hi, I Have Another Bug To Report For Clearml-Server 1.2 (Self Hosted) In The Console Logs Of An Experiments, I Cannot See The Latest Logs. Eg My Experiment Is Done, But I Can Only See The Logs Of To The Installation Of The Packages. If I Download The Log

Hi, I have another bug to report for clearml-server 1.2 (self hosted) In the console logs of an experiments, I cannot see the latest logs. Eg my experiment i...

clearml

2 years ago

0 Votes

27 Answers

1K Views

0 Votes 27 Answers 1K Views

Hi There,

Hi there, I found a memory leak in Logger.report_matplotlib_figure . I was constantly running out of memory when training my models so I decided to spend som...

clearml

one year ago

Show more results

0 Hi, I Face A Strange Behavior From The Clearml-Agent: It’S Running In Services Mode, Not In Docker Mode, Cpu Only. I Want To Execute Two Tasks On This Service Agent. One Works, The Other Always Fails After Being Enqueued And Picked By The Agent With The E

I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird

3 years ago

interestingly, it works on one machine, but not on another one

3 years ago

Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))

3 years ago

The file /tmp/.clearml_agent_out.j7wo7ltp.txt does not exist

3 years ago

0 Hi, I Have An Agent That Is Running Two Experiments At The Same Time: One That Was Running For A Long Time (11H) And One That The Agent Picked Up Afterwards, While The First One Was Still Running. Context: I Have 3 Agents Up (Not In Docker Mode) And All O

yes

4 years ago

0 Hi, I Have Another Bug To Report For Clearml-Server 1.2 (Self Hosted) In The Console Logs Of An Experiments, I Cannot See The Latest Logs. Eg My Experiment Is Done, But I Can Only See The Logs Of To The Installation Of The Packages. If I Download The Log

I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else

2 years ago

So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running

4 years ago

0 Hey There, I Moved The Clearml S3 Bucket Where I Stored All My Clearml Data From One S3 Bucket To Another And Now I Realized That All The Models/Experiments Logged In The Clearml-Server Still Refer To The Old S3 Bucket. Is There A Way To Update All The Re

Thanks a lot for the solution SuccessfulKoala55 ! I’ll try that if the solution “delete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data back” fails

3 years ago

0 Hi, I Attached An Iam Role To An Ec2 Instance To Grant Access To An S3 Bucket. The Ec2 Instance Is Running A Clearml-Agent (V1.1.0). I Didn’T Specify Any Key/Secret For Clearml. The Tasks Fail With The Following Error:

SuccessfulKoala55 I was able to make it work with use_credentials_chain: true in the clearml.conf and the following patch: https://github.com/allegroai/clearml/pull/478

3 years ago

same as the first one described

3 years ago

So it looks like the agent, from time to time thinks it is not running an experiment

4 years ago

When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments

4 years ago

by mistake I have two agents started in one machine

4 years ago

the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine

4 years ago

yes SparklingHedgehong28 🙂

2 years ago

0 Hi There,

Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:

one year ago

0 Hi There,

I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet

one year ago

0 Hi There,

Is it exactly agg or something different?

one year ago

0 Hi There,

Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit

one year ago

0 Hi There,

clearml doesn't change the matplotlib backend under the hood, right? Just making sure 😄

one year ago

0 Hi, How Does

There was no possible cache, the agent was running on a new ec2 instance

one year ago

0 Hi, Although

SuccessfulKoala55 I can try to make one, let’s see 🙂

3 years ago

0 Hi, Although

Does that mean that agents do not read this parameter?

3 years ago

0 Hi, Although

What will this parameter do?

3 years ago

0 Hi, Although

so the task they execute must have clearml installed?

3 years ago

0 Hi There,

Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them

one year ago

Yes, I would like to update all references to the old bucket unfortunately… I think I’ll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I don’t have to mess with clearml data - I am afraid to do something wrong and loose data

3 years ago

0 Hi There,

For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:

import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm

task = Task.init("Debug memory leak", "reproduce")

def plot_data():
    fig, ax = plt.subplots(1, 1)
    t = np.arange(0., 5., 0.2)
    ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
    return fig

for i in tqdm(range(1000), total=1000):
    fig = plot_data()
  ...

one year ago

Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?

3 years ago

0 Hi, Together With

Alright, I will try with that one

4 years ago

Show more results