Reputation
Badges 1
979 × Eureka!I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
interestingly, it works on one machine, but not on another one
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
The file /tmp/.clearml_agent_out.j7wo7ltp.txt
does not exist
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
Thanks a lot for the solution SuccessfulKoala55 ! I’ll try that if the solution “delete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data back” fails
SuccessfulKoala55 I was able to make it work with use_credentials_chain: true
in the clearml.conf and the following patch: https://github.com/allegroai/clearml/pull/478
same as the first one described
So it looks like the agent, from time to time thinks it is not running an experiment
When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments
by mistake I have two agents started in one machine
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
Yes, I would like to update all references to the old bucket unfortunately… I think I’ll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I don’t have to mess with clearml data - I am afraid to do something wrong and loose data
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?