Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Community! Is There An Option To Only Download A Part Of A Dataset With .Get_Local_Copy()? I Imagine Something Like This, But I Can'T Find The Right Way To Do It.

Hello community! Is there an option to only download a part of a Dataset with .get_local_copy()? I imagine something like this, but I can't find the right way to do it.
ds = Dataset.get(dataset_ID) filelist = ds.list_files() ds.get_local_copy(filelist[:5])My datasets are large and for testing code locally, I would like to be able to download only parts of a Dataset. Thank you!

  
  
Posted 2 years ago
Votes Newest

Answers 26


shared "warm" folder without having to download the dataset locally.

This is already supported 🙂
Configure the sdk.storage.cache.default_base_dir in your clearml.conf to point to a shared (mounted) folder
https://github.com/allegroai/clearml-agent/blob/21c4857795e6392a848b296ceb5480aca5f98e4b/docs/clearml.conf#L205
That's it 🙂

  
  
Posted 2 years ago

Lets say that this small dataset has a ID and i can use get_local_copy() method to cache it locally and then i can use the remote servers to train it. But I would like to have the same flow without downloading the full dataset which is stored remotely.

  
  
Posted 2 years ago

This get_local_copy() method is only useful for applications which have datasets in the range of < 10gigs and the training machine is the same as dev machine. Most of us(researchers) its not the case, we share GPU time, this is where clearml comes in.
Requirements: The large dataset should only be a single copy preserving the original folder structure which is presumed to be available remotely and the non-mutable access should be provided via dataset_id. This solves everything or atleast most of the thing.

  
  
Posted 2 years ago

and this path should follow linux folder structure not a single file like the current .zip.

I like where this is going 🙂
So are we thinking like a "shared" folder where the data is kept "warm" and a single source of truth where the packaged zip file is stored (like object storage, e.g. S3)

  
  
Posted 2 years ago

So for this...

Sorry, what is exactly "this" ?

  
  
Posted 2 years ago

🤞

  
  
Posted 2 years ago

TimelyPenguin76 Could you please give more clarification about the process? cause I cannot find this in the docs. How to create a parent-child Dataset with a same dataset_id and only access the child?

  
  
Posted 2 years ago

Thank you for clarifying the parent-child thing. When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ). The whole dataset(both large and small) could be created and uploaded by admin. As a researcher, i normally work with a smaller dataset similar to what SucculentBeetle7 has stated. You should also note that this whole training happens in a remote server. So this situation applies https://clear.ml/docs/latest/docs/getting_started/ds/best_practices#train-remotely .

  
  
Posted 2 years ago

I see...
Current (and this will change soon) the entire delta is stored in a single file, so there is no real way to download a "subset" of the data, only a parent version 😞

Lets say that this small dataset has a ID ....

Yes this would be exactly the way to do so:

` param ={'dataset': small_train_dataset_id_here}
task.connect(param)

dataset_folder = Dataset.get(param['dataset']).get_local_copy()
... Locally it will use the small_train_dataset_id_here ` , then when launched remotely you can change the new parameter "dataset" to the full dataset ID, the code will not change, as task.connect is a two way function, when running locally it stored the content on the UI, when running remotely it takes the parameters from the UI and puts them back to the dict 🙂
wdyt ?

  
  
Posted 2 years ago

There should be a method called as read_remote_copy(str:dataset_id, str:dataset_tag,bool:mutable) and this should return the path of the remote data.

  
  
Posted 2 years ago

Anyone who is using small dataset can afford to go with the get_local_copy()

  
  
Posted 2 years ago

So for this, should I create a proper issue in the Github? or is this being picked up internally AgitatedDove14

  
  
Posted 2 years ago

shared "warm" folder without having to download the dataset locally.

  
  
Posted 2 years ago

BitterLeopard33

How to create a parent-child Dataset with a same dataset_id and only access the child?

Dataset ID is unique, the child will have a different UID. The name of the Dataset can the the same though.
Specifically to create a child Dataset:
https://clear.ml/docs/latest/docs/clearml_data#datasetcreate
child = Dataset.create(..., parent_datasets=['parent_datast_id'])

Are there any ways to access the parent dataset(assuming its large and i dont want to download it)

What do you mean by accessing it, without actually downloadable the files? Is it listing ?
https://clear.ml/docs/latest/docs/references/sdk/dataset#list_files

  
  
Posted 2 years ago

Yes a structure similar to shared folder should be the optimal solution. But i don't understand what you mean by "warm"!!

  
  
Posted 2 years ago

When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ).

How can you "access" it without downloading it ?
Do you mean train locally on a subset, then on the full dataset remotely ?

  
  
Posted 2 years ago

Kinda yes.

  
  
Posted 2 years ago

Thank you! Yes that might be the best option. I'll have to divide it already when I create the datasets then, right?

  
  
Posted 2 years ago

and this path should follow linux folder structure not a single file like the current .zip.

  
  
Posted 2 years ago

Hi SucculentBeetle7 ,

get_local_copy()  will return the entire dataset (the zip file), but you can divide the dataset and have the same parent for all of them, what do you think?

  
  
Posted 2 years ago

Ok, thank you!

  
  
Posted 2 years ago

yes 🙂

  
  
Posted 2 years ago

"warm" as you do not need to sync it with the dataset, every time you access the dataset, clearml will make sure it is there in the cache, when you switch to a new dataset the new dataset will be cached. make sense?

  
  
Posted 2 years ago

Thanks. Let me try it and get back to you.

  
  
Posted 2 years ago

Cause this would again cause the problems which i asked yesterday. Are there any ways to access the parent dataset(assuming its large and i dont want to download it) without using get_local_copy() as that would solve a lot of problems? If so where can I find them in the docs?

  
  
Posted 2 years ago