Hello Community! Is There An Option To Only Download A Part Of A Dataset With .Get_Local_Copy()? I Imagine Something Like This, But I Can'T Find The Right Way To Do It.

Answered

Hello community! Is there an option to only download a part of a Dataset with .get_local_copy()? I imagine something like this, but I can't find the right way to do it.
ds = Dataset.get(dataset_ID) filelist = ds.list_files() ds.get_local_copy(filelist[:5])My datasets are large and for testing code locally, I would like to be able to download only parts of a Dataset. Thank you!

  				
Posted 
	3 years ago

					More  		
  Report
		
					SucculentBeetle7
				
					0
					 × 1

Votes Newest

Answers 26

There should be a method called as read_remote_copy(str:dataset_id, str:dataset_tag,bool:mutable) and this should return the path of the remote data.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

and this path should follow linux folder structure not a single file like the current .zip.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Thanks. Let me try it and get back to you.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

🤞

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

shared "warm" folder without having to download the dataset locally.

This is already supported 🙂
Configure the sdk.storage.cache.default_base_dir in your clearml.conf to point to a shared (mounted) folder
https://github.com/allegroai/clearml-agent/blob/21c4857795e6392a848b296ceb5480aca5f98e4b/docs/clearml.conf#L205
That's it 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Kinda yes.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

TimelyPenguin76 Could you please give more clarification about the process? cause I cannot find this in the docs. How to create a parent-child Dataset with a same dataset_id and only access the child?

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Thank you for clarifying the parent-child thing. When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ). The whole dataset(both large and small) could be created and uploaded by admin. As a researcher, i normally work with a smaller dataset similar to what SucculentBeetle7 has stated. You should also note that this whole training happens in a remote server. So this situation applies https://clear.ml/docs/latest/docs/getting_started/ds/best_practices#train-remotely .

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Yes a structure similar to shared folder should be the optimal solution. But i don't understand what you mean by "warm"!!

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Thank you! Yes that might be the best option. I'll have to divide it already when I create the datasets then, right?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SucculentBeetle7
				
					0
					 × 1

shared "warm" folder without having to download the dataset locally.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Lets say that this small dataset has a ID and i can use get_local_copy() method to cache it locally and then i can use the remote servers to train it. But I would like to have the same flow without downloading the full dataset which is stored remotely.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

and this path should follow linux folder structure not a single file like the current .zip.

I like where this is going 🙂
So are we thinking like a "shared" folder where the data is kept "warm" and a single source of truth where the packaged zip file is stored (like object storage, e.g. S3)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Anyone who is using small dataset can afford to go with the get_local_copy()

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Cause this would again cause the problems which i asked yesterday. Are there any ways to access the parent dataset(assuming its large and i dont want to download it) without using get_local_copy() as that would solve a lot of problems? If so where can I find them in the docs?

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

BitterLeopard33

How to create a parent-child Dataset with a same dataset_id and only access the child?

Dataset ID is unique, the child will have a different UID. The name of the Dataset can the the same though.
Specifically to create a child Dataset:
https://clear.ml/docs/latest/docs/clearml_data#datasetcreate
child = Dataset.create(..., parent_datasets=['parent_datast_id'])

Are there any ways to access the parent dataset(assuming its large and i dont want to download it)

What do you mean by accessing it, without actually downloadable the files? Is it listing ?
https://clear.ml/docs/latest/docs/references/sdk/dataset#list_files

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi SucculentBeetle7 ,

get_local_copy() will return the entire dataset (the zip file), but you can divide the dataset and have the same parent for all of them, what do you think?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ).

How can you "access" it without downloading it ?
Do you mean train locally on a subset, then on the full dataset remotely ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see...
Current (and this will change soon) the entire delta is stored in a single file, so there is no real way to download a "subset" of the data, only a parent version 😞

Lets say that this small dataset has a ID ....

Yes this would be exactly the way to do so:

` param ={'dataset': small_train_dataset_id_here}
task.connect(param)

dataset_folder = Dataset.get(param['dataset']).get_local_copy()
... Locally it will use the small_train_dataset_id_here ` , then when launched remotely you can change the new parameter "dataset" to the full dataset ID, the code will not change, as task.connect is a two way function, when running locally it stored the content on the UI, when running remotely it takes the parameters from the UI and puts them back to the dict 🙂
wdyt ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This get_local_copy() method is only useful for applications which have datasets in the range of < 10gigs and the training machine is the same as dev machine. Most of us(researchers) its not the case, we share GPU time, this is where clearml comes in.
Requirements: The large dataset should only be a single copy preserving the original folder structure which is presumed to be available remotely and the non-mutable access should be provided via dataset_id. This solves everything or atleast most of the thing.

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

Ok, thank you!

  				
Posted 
	3 years ago

					More  		
  Report
		
					SucculentBeetle7
				
					0
					 × 1

Feature request for this: https://clearml.slack.com/archives/CTK20V944/p1629407988075800?thread_ts=1629373886.064600&cid=CTK20V944

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

So for this...

Sorry, what is exactly "this" ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

"warm" as you do not need to sync it with the dataset, every time you access the dataset, clearml will make sure it is there in the cache, when you switch to a new dataset the new dataset will be cached. make sense?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So for this, should I create a proper issue in the Github? or is this being picked up internally AgitatedDove14

  				
Posted 
	3 years ago

					More  		
  Report
		
					BitterLeopard33
				
					0
					 × 1

yes 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Write your answer

2K Views

26 Answers

3 years ago

2 years ago