So for this, should I create a proper issue in the Github? or is this being picked up internally AgitatedDove14
Yes a structure similar to shared folder should be the optimal solution. But i don't understand what you mean by "warm"!!
Cause this would again cause the problems which i asked yesterday. Are there any ways to access the parent dataset(assuming its large and i dont want to download it) without using get_local_copy()
as that would solve a lot of problems? If so where can I find them in the docs?
Anyone who is using small dataset can afford to go with the get_local_copy()
So for this...
Sorry, what is exactly "this" ?
There should be a method called as read_remote_copy(str:dataset_id, str:dataset_tag,bool:mutable)
and this should return the path of the remote data.
This get_local_copy()
method is only useful for applications which have datasets in the range of < 10gigs and the training machine is the same as dev machine. Most of us(researchers) its not the case, we share GPU time, this is where clearml comes in.
Requirements: The large dataset should only be a single copy preserving the original folder structure which is presumed to be available remotely and the non-mutable access should be provided via dataset_id. This solves everything or atleast most of the thing.
shared "warm" folder without having to download the dataset locally.
and this path should follow linux folder structure not a single file like the current .zip.
I like where this is going 🙂
So are we thinking like a "shared" folder where the data is kept "warm" and a single source of truth where the packaged zip file is stored (like object storage, e.g. S3)
Lets say that this small dataset has a ID and i can use get_local_copy()
method to cache it locally and then i can use the remote servers to train it. But I would like to have the same flow without downloading the full dataset which is stored remotely.
I see...
Current (and this will change soon) the entire delta is stored in a single file, so there is no real way to download a "subset" of the data, only a parent version 😞
Lets say that this small dataset has a ID ....
Yes this would be exactly the way to do so:
` param ={'dataset': small_train_dataset_id_here}
task.connect(param)
dataset_folder = Dataset.get(param['dataset']).get_local_copy()
... Locally it will use the
small_train_dataset_id_here ` , then when launched remotely you can change the new parameter "dataset" to the full dataset ID, the code will not change, as task.connect is a two way function, when running locally it stored the content on the UI, when running remotely it takes the parameters from the UI and puts them back to the dict 🙂
wdyt ?
Thanks. Let me try it and get back to you.
and this path should follow linux folder structure not a single file like the current .zip.
Hi SucculentBeetle7 ,
get_local_copy()
 will return the entire dataset (the zip file), but you can divide the dataset and have the same parent for all of them, what do you think?
When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ).
How can you "access" it without downloading it ?
Do you mean train locally on a subset, then on the full dataset remotely ?
Thank you for clarifying the parent-child thing. When i say accessing, it means i want to use the data for training(without actually getting a local copy of it ). The whole dataset(both large and small) could be created and uploaded by admin. As a researcher, i normally work with a smaller dataset similar to what SucculentBeetle7 has stated. You should also note that this whole training happens in a remote server. So this situation applies https://clear.ml/docs/latest/docs/getting_started/ds/best_practices#train-remotely .
Thank you! Yes that might be the best option. I'll have to divide it already when I create the datasets then, right?
"warm" as you do not need to sync it with the dataset, every time you access the dataset, clearml
will make sure it is there in the cache, when you switch to a new dataset the new dataset will be cached. make sense?
TimelyPenguin76 Could you please give more clarification about the process? cause I cannot find this in the docs. How to create a parent-child Dataset with a same dataset_id and only access the child?
BitterLeopard33
How to create a parent-child Dataset with a same dataset_id and only access the child?
Dataset ID is unique, the child will have a different UID. The name of the Dataset can the the same though.
Specifically to create a child Dataset:
https://clear.ml/docs/latest/docs/clearml_data#datasetcreatechild = Dataset.create(..., parent_datasets=['parent_datast_id'])
Are there any ways to access the parent dataset(assuming its large and i dont want to download it)
What do you mean by accessing it, without actually downloadable the files? Is it listing ?
https://clear.ml/docs/latest/docs/references/sdk/dataset#list_files
Feature request for this: https://clearml.slack.com/archives/CTK20V944/p1629407988075800?thread_ts=1629373886.064600&cid=CTK20V944
shared "warm" folder without having to download the dataset locally.
This is already supported 🙂
Configure the sdk.storage.cache.default_base_dir
in your clearml.conf to point to a shared (mounted) folder
https://github.com/allegroai/clearml-agent/blob/21c4857795e6392a848b296ceb5480aca5f98e4b/docs/clearml.conf#L205
That's it 🙂