Reputation
Badges 1
981 × Eureka!Here is the console with some errors
yes, in the code, i do:task._wait_for_repo_detection() REQS_TASK = ["torch==1.3.1", "pytorch-ignite @ git+ ", "."] task._update_requirements(REQS_TASK) task.execute_remotely(queue_name=args.queue, clone=False, exit_process=True)
Notice the last line should not have
--docker
Did you meant --detached ?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR π
AgitatedDove14 The first time it installs and create the cache for the env, the second time it fails with:Applying uncommitted changes ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. clearml_agent: ERROR: Command '['/home/user/.clearml/venvs-builds.1/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsmncaxx45.txt']' returned non-zero exit status 1.
There is an example in the https://github.com/allegroai/clearml/blob/master/docs/datasets.md#workflow section of the linked I shared above
That gave me
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3
Building Task 94jfk2479851047c18f1fa60c1364b871 inside docker: ubuntu:18.04
Starting docker build
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
Now it starts, Iβll see if this solves the issue
Also what is the benefit of having by default index.number_of_shards = 1 for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
since we removed "." from the requirements?
AgitatedDove14 Didnβt work π
Basically what I did is:
` if task_name is not None:
project_name = parent_task.get_project_name()
task = Task.get_task(project_name, task_name)
if task is not None:
return task
Otherwise here I create the Task `
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem π
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()But I still get KeyError: 'output' ... Was that normal? Will it work if I replace the second line with task.refresh () ?
Looking at the source code, it seems like I should do:data_processing_task._artifact_manager.flush() to make sure to have the latest version of artifacts in the task, right?
Thanks AgitatedDove14 !
Could we add this task.refresh() on the docs? Might be helpful for other users as well π OK! Maybe there is a middle ground: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
Not really: I just need to find the one that is compatible with torch==1.3.1
/opt/clearml/data/fileserver does not appear anywhere, sorry for the confusion - Itβs the actual location where the files are stored
Ok, by setting PyJWT==1.7.1 in the setup.py of the experiment pip did not enforced the update
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
AgitatedDove14 awesome! by "include it all" do you mean wizard for azure and gcp?
MagnificentSeaurchin79 You could also just fork the tensorflow repo, make changes in a specific branch and specify your forked repo with your custom branch in the install_requires of your setup.py
Bottom line is: trains-server uses elastichsearch image: http://docker.elastic.co/elasticsearch/elasticsearch:5.6.16 which does not have an unlimited license (only free license that expires after some time). From versions 6.3, elasticsearch provides an unlimited free license. Trains should use >=6.3, WDYT?
For me it is definitely reproducible π But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...



