Answered

Hi All! Currently I Am Trying To Create A Tool That Can Perform Certain Operations On Dataset Ids, This Is A Skeleton Of What I Have In Mind (Based On The Examples):

Hi all! Currently I am trying to create a tool that can perform certain operations on dataset ids, this is a skeleton of what I have in mind (based on the examples):
` from argparse import ArgumentParser
from clearml import Dataset

adding command line interface, so it is easy to use

parser = ArgumentParser()
parser.add_argument('--dataset', default='aayyzz', type=str, help='Dataset ID to train on')
parser.add_argument('--height', default=1.6, type=float, help='Minimum height')
args = parser.parse_args()

getting a local copy of the dataset

parent_dataset = Dataset.get(dataset_id=args.dataset)
dataset_folder = parent_dataset.get_mutable_local_copy()

Here I filter files on dataset_folder based on the height.

...

Create a new dataset and upload files (?)

child_dataset = Dataset.create(..., parent_datasets=[parent_dataset])
child_dataset.add_files(dataset_folder)
child_dataset.upload() I just wanted to know if this is the best approach or there are other methods on Dataset that can help. Some questions regarding the approach: Will it generate to copies of the dataset even if the operation only removes some files from dataset_folder ? If some files are changed will there be any difference? If I have the files on aws, gcp, etc.. does get_mutable_local_copy() ` download every time the files? or does it work like the artifacts where there is caching. Assume I run two different operations from the same parent dataset i,e filter by height and filter by age.

  				
Posted 
	4 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Votes Newest

Answers 2

Hi GrievingTurkey78
First, I would look at the CLI clearml-data as a baseline for implementing such a tool:
Docs:
https://github.com/allegroai/clearml/blob/master/docs/datasets.md
Implementation :
https://github.com/allegroai/clearml/blob/master/clearml/cli/data/main.py
Regrading your questions:
(1) No, a new dataset version will only store the diff from the parent (if files are removed it stored the metadata that says the file was removed)
(2) Yes any get operation will download unzip and merge the files into the local storage, for easier access. The the 'mutable` copy will create a copy of the files, where as the "regular" get will create softlinks to the local cached copy of the unzipped files

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks AgitatedDove14

  				
Posted 
	4 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Write your answer

1K Views

2 Answers

4 years ago

2 years ago