Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05 folder to /opt/trains/data/elastic before running the migration tool right?
AgitatedDove14 Same problem with clearml==1.1.5rc2 π , I also tried with backend==gloo , still same problem
I created a snapshot of both disks
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0) Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)So I guess itβs not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True) like this:
` def log_loss(engine):
idist.barrier()
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}")
Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.itera...
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now π
Thanks! Unfortunately still not working, here is the log file:
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
(Btw the instance listed in the console has no name, it it normal?)
Still getting the same error, it is not taken into account π€
and this works. However, without the trick from UnevenDolphin73 , the following wonβt work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
Thanks SuccessfulKoala55 π
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
edited the aws_auto_scaler.py, actually I think itβs just a typo, I just need to double the brackets
For me it is definitely reproducible π But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment