Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello! Is There Any Way To Download A Part Of Dataset? For Instance, I Have A Large Dataset Which I Periodically Update By Adding A New Batch Of Data And Creating A New Dataset. Once, I Found Out Mistakes In Data, And I Want To Download An Exact Folder/Ba

Hello!
Is there any way to download a part of dataset? For instance, I have a large dataset which I periodically update by adding a new batch of data and creating a new dataset. Once, I found out mistakes in data, and I want to download an exact folder/batch of the dataset to my local machine to check data out without downloading whole dataset.

  
  
Posted 2 years ago
Votes Newest

Answers 5


Let’s say I have a dataset from source A, dataset is finalised, upload and looks like this:
train_data/data_from_source_AEach month I receive new batch of data, create new dataset and upload it. And after few months my dataset looks like this:
train_data/data_from_source_A train_data/data_from_source_B train_data/data_from_source_C train_data/data_from_source_D train_data/data_from_source_EEach batch of data was added via creating a new dataset and adding files. Now, I have a large dataset. I can download whole data to local server and start training. Let’s say I found out that data in data_from_source_C has some issue. I want to let data engineer from my team download exactly this folder and fix issue (it can be anything). How to do this without downloading whole dataset?

  
  
Posted 2 years ago

If the data is updated into the same local / network folder structure, which serves as a dataset's single point of truth, you can schedule a script which uses the dataset sync functionality which will update the dataset based on the modifications made to the folder.

You can then modify precisely what you need in that structure, and get a new updated dataset version

  
  
Posted 2 years ago

Thank you, it good way to handle it. Of course, it would be great to have such func in clear ml. Only this stops me from deployment.

  
  
Posted 2 years ago

Hi TeenyBeetle18
If the dataset could be basically built from a local machine, you could use the sync_folder (sdk https://clear.ml/docs/latest/docs/references/sdk/dataset#sync_folder or cli https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_folder_sync#syncing-a-folder ). then you would be able to modify any part of the dataset and create a new version, with only the items that changed.

There is also an option to download only parts of the dataset, have a look https://clear.ml/docs/latest/docs/references/sdk/dataset#get_mutable_local_copy at the paramters part and num_parts .

If you need more precisions, could you please provide us some more details on what you need to achieve ?

  
  
Posted 2 years ago

I want to download an exact folder/batch of the dataset to my local machine to check data out without downloading whole dataset.

TeenyBeetle18 the closest you can get is to download only one part of the dataset, if this is a multi part dataset (i.e. the dataset version is larger than the default 500MB, so you have multiple izp files, and you just want to download one of them, not all of them).
This can actually be achieved with:
Dataset.get_local_copy(..., part=0)
https://github.com/allegroai/clearml/blob/717edba8c2b39fb7486bd2aba9ca0294f309b4c3/clearml/datasets/dataset.py#L683

  
  
Posted 2 years ago
1K Views
5 Answers
2 years ago
2 years ago
Tags
Similar posts