Yes, that's correct. I don't want to re-download datasets because of their large size.
From an efficiency perspective, we should be pulling data as we feed into training. That said, always a good idea to uncompress large zip files and store them as smaller ones that allow you to batch pull for training.
So when you say the files are deleted, how can you tell? where did you look for them?
ClearML agent will delete all datasets
I'm not sure I understood how you've ran the agent...
Sorry. I probably misunderstood you. I just downloaded the clearml-agent package to my machine and ran the agent with the following command: python -m clearml_agent daemon --queue default dinara --docker --detached
ExcitedSeaurchin87 , Hi 🙂
I think it's correct behavior - You wouldn't want leftover files flooding your computer.
Regarding preserving the datasets - I'm guessing that you're doing the pre-processing & training in the same task so if the training fails you don't want to re-download the data?
Hi ExcitedSeaurchin87 , I think the files are being downloaded to the cache, and the cache simply overwrites older files. How are you running the agent exactly?
SuccessfulKoala55
I initialized the task with Python:
task = Task.init(project_name=args.project_name, task_name=args.task_name)
and downloaded set of datasets later in the code:
for dataset_name in datasets_list:
clearml_dataset = clearml.Dataset.get(dataset_project=dataset_project, dataset_name=dataset_name)
clearml_dataset_path = clearml_dataset.get_local_copy()
Then I go through the resulting directories in search of the files I need, and send their paths to Pytorch dataset object. If the run fails somewhere later I want to preserve these datasets downloaded.