Yes, that's correct. I don't want to re-download datasets because of their large size.
From an efficiency perspective, we should be pulling data as we feed into training. That said, always a good idea to uncompress large zip files and store them as smaller ones that allow you to batch pull for training.
Hi ExcitedSeaurchin87 , I think the files are being downloaded to the cache, and the cache simply overwrites older files. How are you running the agent exactly?
ExcitedSeaurchin87 , Hi 🙂
I think it's correct behavior - You wouldn't want leftover files flooding your computer.
Regarding preserving the datasets - I'm guessing that you're doing the pre-processing & training in the same task so if the training fails you don't want to re-download the data?
Sorry. I probably misunderstood you. I just downloaded the clearml-agent package to my machine and ran the agent with the following command: python -m clearml_agent daemon --queue default dinara --docker --detached
ClearML agent will delete all datasets
I'm not sure I understood how you've ran the agent...
SuccessfulKoala55
I initialized the task with Python:
task = Task.init(project_name=args.project_name, task_name=args.task_name)
and downloaded set of datasets later in the code:
for dataset_name in datasets_list:
clearml_dataset = clearml.Dataset.get(dataset_project=dataset_project, dataset_name=dataset_name)
clearml_dataset_path = clearml_dataset.get_local_copy()
Then I go through the resulting directories in search of the files I need, and send their paths to Pytorch dataset object. If the run fails somewhere later I want to preserve these datasets downloaded.
So when you say the files are deleted, how can you tell? where did you look for them?