There should be a method called as read_remote_copy(str:dataset_id, str:dataset_tag,bool:mutable)
and this should return the path of the remote data.
and this path should follow linux folder structure not a single file like the current .zip.
Thanks. Let me try it and get back to you.
shared "warm" folder without having to download the dataset locally.
This is already supported 🙂
Configure the sdk.storage.cache.default_base_dir
in your clearml.conf to point to a shared (mounted) folder
https://github.com/allegroai/clearml-agent/blob/21c4857795e6392a848b296ceb5480aca5f98e4b/docs/clearml.conf#L205
That's it 🙂
TimelyPenguin76 Could you please give more clarification about the process? cause I cannot find this in the docs. How to create a parent-child Dataset with a same dataset_id and only access the child?
Thank you for clarifying the parent-child thing. When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ). The whole dataset(both large and small) could be created and uploaded by admin. As a researcher, i normally work with a smaller dataset similar to what SucculentBeetle7 has stated. You should also note that this whole training happens in a remote server. So this situation applies https://clear.ml/docs/latest/docs/getting_started/ds/best_practices#train-remotely .
Yes a structure similar to shared folder should be the optimal solution. But i don't understand what you mean by "warm"!!
Thank you! Yes that might be the best option. I'll have to divide it already when I create the datasets then, right?
shared "warm" folder without having to download the dataset locally.
Lets say that this small dataset has a ID and i can use get_local_copy()
method to cache it locally and then i can use the remote servers to train it. But I would like to have the same flow without downloading the full dataset which is stored remotely.
and this path should follow linux folder structure not a single file like the current .zip.
I like where this is going 🙂
So are we thinking like a "shared" folder where the data is kept "warm" and a single source of truth where the packaged zip file is stored (like object storage, e.g. S3)
Anyone who is using small dataset can afford to go with the get_local_copy()
Cause this would again cause the problems which i asked yesterday. Are there any ways to access the parent dataset(assuming its large and i dont want to download it) without using get_local_copy()
as that would solve a lot of problems? If so where can I find them in the docs?
BitterLeopard33
How to create a parent-child Dataset with a same dataset_id and only access the child?
Dataset ID is unique, the child will have a different UID. The name of the Dataset can the the same though.
Specifically to create a child Dataset:
https://clear.ml/docs/latest/docs/clearml_data#datasetcreatechild = Dataset.create(..., parent_datasets=['parent_datast_id'])
Are there any ways to access the parent dataset(assuming its large and i dont want to download it)
What do you mean by accessing it, without actually downloadable the files? Is it listing ?
https://clear.ml/docs/latest/docs/references/sdk/dataset#list_files
Hi SucculentBeetle7 ,
get_local_copy()
will return the entire dataset (the zip file), but you can divide the dataset and have the same parent for all of them, what do you think?
When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ).
How can you "access" it without downloading it ?
Do you mean train locally on a subset, then on the full dataset remotely ?
I see...
Current (and this will change soon) the entire delta is stored in a single file, so there is no real way to download a "subset" of the data, only a parent version 😞
Lets say that this small dataset has a ID ....
Yes this would be exactly the way to do so:
` param ={'dataset': small_train_dataset_id_here}
task.connect(param)
dataset_folder = Dataset.get(param['dataset']).get_local_copy()
... Locally it will use the
small_train_dataset_id_here ` , then when launched remotely you can change the new parameter "dataset" to the full dataset ID, the code will not change, as task.connect is a two way function, when running locally it stored the content on the UI, when running remotely it takes the parameters from the UI and puts them back to the dict 🙂
wdyt ?
This get_local_copy()
method is only useful for applications which have datasets in the range of < 10gigs and the training machine is the same as dev machine. Most of us(researchers) its not the case, we share GPU time, this is where clearml comes in.
Requirements: The large dataset should only be a single copy preserving the original folder structure which is presumed to be available remotely and the non-mutable access should be provided via dataset_id. This solves everything or atleast most of the thing.
Feature request for this: https://clearml.slack.com/archives/CTK20V944/p1629407988075800?thread_ts=1629373886.064600&cid=CTK20V944
So for this...
Sorry, what is exactly "this" ?
"warm" as you do not need to sync it with the dataset, every time you access the dataset, clearml
will make sure it is there in the cache, when you switch to a new dataset the new dataset will be cached. make sense?
So for this, should I create a proper issue in the Github? or is this being picked up internally AgitatedDove14