Reputation
Badges 1
981 × Eureka!since we removed "." from the requirements?
AgitatedDove14 Didn’t work 😞
Basically what I did is:
` if task_name is not None:
project_name = parent_task.get_project_name()
task = Task.get_task(project_name, task_name)
if task is not None:
return task
Otherwise here I create the Task `
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem 🙂
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()But I still get KeyError: 'output' ... Was that normal? Will it work if I replace the second line with task.refresh () ?
Thanks AgitatedDove14 !
Could we add this task.refresh() on the docs? Might be helpful for other users as well 🙂 OK! Maybe there is a middle ground: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
Not really: I just need to find the one that is compatible with torch==1.3.1
/opt/clearml/data/fileserver does not appear anywhere, sorry for the confusion - It’s the actual location where the files are stored
Ok, by setting PyJWT==1.7.1 in the setup.py of the experiment pip did not enforced the update
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
AgitatedDove14 awesome! by "include it all" do you mean wizard for azure and gcp?
MagnificentSeaurchin79 You could also just fork the tensorflow repo, make changes in a specific branch and specify your forked repo with your custom branch in the install_requires of your setup.py
Bottom line is: trains-server uses elastichsearch image: http://docker.elastic.co/elasticsearch/elasticsearch:5.6.16 which does not have an unlimited license (only free license that expires after some time). From versions 6.3, elasticsearch provides an unlimited free license. Trains should use >=6.3, WDYT?
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
That's why I suspected trains was installing a different version that the one I expected
I was rather wondering why clearml was taking space while I configured it to use the /data volume. But as you described AgitatedDove14 it looks like an edge case, so I don’t mind 🙂
I asked this question some time ago, I think this is just not implemented but it shouldn’t be difficult to add? I am also interested in such feature!
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task
Ok, deleting installed packages list worked for the first task
Hi CostlyOstrich36 , I mean insert temporary access keys
Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak
This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Hi CumbersomeCormorant74 yes, this is almost the scenario: I have a dozen of projects. In one of them, I have ~20 archived experiments, in different states (draft, failed, aborted, completed). I went to this archive, selected all of them and deleted them using the bulk delete operation. I had several failed delete popups. So I tried again with smaller bulks (like 5 experiments at a time) to localize the experiments at the origin of the error. I could delete most of them. At some point, all ...