SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
What I put in the clearml.conf is the following:
agent.package_manager.pip_version = "==20.2.3" agent.package_manager.extra_index_url: ["
"] agent.python_binary = python3.8
yes, exactly: I run python my_script.py
, the script executes, creates the task, calls task.remote_execute(exit_process=True)
and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
ok, what is the 3.8 release? a server release? how does this number relates to the numbers above?
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
The rest of the configuration is set with env variables
When installed with http://get.docker.com , it works
(docker was install with sudo snap install docker
)
might be worth documenting π
Just tested locally, in terminal its the same: with the hack it works, without the hack it doesn't show the logger messages
I tried removing type=str but I got same problem π
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
Thanks for the hack! The use case is the following: I have a controler that creates training/validation/testing tasks by cloning (so that the parent task id is properly set to the controler). Otherwise I could simply create these tasks with Task.init, but then I would need to set manually the parent task for each one of these tasks, probably with a similar hack, right?
Although task.data.last_iteration
Β is correct when resuming, there is still this doubling effect when logging metrics after resuming π
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
For me it is definitely reproducible π But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
It indeed has the old commit, so they match, no problem actually π
TimelyPenguin76 That sounds amazing! will there be a fallback mechanism as well? often p3.2xlarge are on shortage, would be nice to define one resources req as first choice (eg. p3.2xlarge) -> if not available -> use another resources req (eg. g4dn)
So I installed docker, added user to group allowed to run docker (not to have to run with sudo, otherwise it fails), then ran these two commands and it worked
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
That doesnβt work unfortunately
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
But clearml does read from env vars as well right? Itβs not just delegating resolution to the aws cli, so it should be possible to specify the region to use for the logger, right?
So the migration from one server to another + adding new accounts with password worked, thanks for your help!
` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...
To help you debugging this: in the /dashboard endpoint, all projects were still there, but empty (no experiment inside). No experiments archived as well.
How about the overhead of running the training on docker on a VM?