Hello! Is There Any Way To Download A Part Of Dataset? For Instance, I Have A Large Dataset Which I Periodically Update By Adding A New Batch Of Data And Creating A New Dataset. Once, I Found Out Mistakes In Data, And I Want To Download An Exact Folder/Ba

Answered

Hello!
Is there any way to download a part of dataset? For instance, I have a large dataset which I periodically update by adding a new batch of data and creating a new dataset. Once, I found out mistakes in data, and I want to download an exact folder/batch of the dataset to my local machine to check data out without downloading whole dataset.

  				
Posted 
	2 years ago

					More  		
  Report
		
					TeenyBeetle18
				
					0
					 × 1

Votes Newest

Answers 5

Thank you, it good way to handle it. Of course, it would be great to have such func in clear ml. Only this stops me from deployment.

  				
Posted 
	2 years ago

					More  		
  Report
		
					TeenyBeetle18
				
					0
					 × 1

If the data is updated into the same local / network folder structure, which serves as a dataset's single point of truth, you can schedule a script which uses the dataset sync functionality which will update the dataset based on the modifications made to the folder.

You can then modify precisely what you need in that structure, and get a new updated dataset version

  				
Posted 
	2 years ago

					More  		
  Report
		
					SweetBadger76
				
					0
					 × 1

Hi TeenyBeetle18
If the dataset could be basically built from a local machine, you could use the sync_folder (sdk https://clear.ml/docs/latest/docs/references/sdk/dataset#sync_folder or cli https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_folder_sync#syncing-a-folder ). then you would be able to modify any part of the dataset and create a new version, with only the items that changed.

There is also an option to download only parts of the dataset, have a look https://clear.ml/docs/latest/docs/references/sdk/dataset#get_mutable_local_copy at the paramters part and num_parts .

If you need more precisions, could you please provide us some more details on what you need to achieve ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SweetBadger76
				
					0
					 × 1

Let’s say I have a dataset from source A, dataset is finalised, upload and looks like this:
train_data/data_from_source_AEach month I receive new batch of data, create new dataset and upload it. And after few months my dataset looks like this:
train_data/data_from_source_A train_data/data_from_source_B train_data/data_from_source_C train_data/data_from_source_D train_data/data_from_source_EEach batch of data was added via creating a new dataset and adding files. Now, I have a large dataset. I can download whole data to local server and start training. Let’s say I found out that data in data_from_source_C has some issue. I want to let data engineer from my team download exactly this folder and fix issue (it can be anything). How to do this without downloading whole dataset?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TeenyBeetle18
				
					0
					 × 1

I want to download an exact folder/batch of the dataset to my local machine to check data out without downloading whole dataset.

TeenyBeetle18 the closest you can get is to download only one part of the dataset, if this is a multi part dataset (i.e. the dataset version is larger than the default 500MB, so you have multiple izp files, and you just want to download one of them, not all of them).
This can actually be achieved with:
Dataset.get_local_copy(..., part=0)
https://github.com/allegroai/clearml/blob/717edba8c2b39fb7486bd2aba9ca0294f309b4c3/clearml/datasets/dataset.py#L683

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

5 Answers

2 years ago