Hi TeenyBeetle18
If the dataset could be basically built from a local machine, you could use the sync_folder (sdk https://clear.ml/docs/latest/docs/references/sdk/dataset#sync_folder or cli https://clear.ml/docs/latest/docs/clearml_data/data_management_examples/data_man_folder_sync#syncing-a-folder ). then you would be able to modify any part of the dataset and create a new version, with only the items that changed.
There is also an option to download only parts of the dataset, have a look https://clear.ml/docs/latest/docs/references/sdk/dataset#get_mutable_local_copy at the paramters part and num_parts .
If you need more precisions, could you please provide us some more details on what you need to achieve ?
Let’s say I have a dataset from source A, dataset is finalised, upload and looks like this:train_data/data_from_source_A
Each month I receive new batch of data, create new dataset and upload it. And after few months my dataset looks like this:train_data/data_from_source_A train_data/data_from_source_B train_data/data_from_source_C train_data/data_from_source_D train_data/data_from_source_E
Each batch of data was added via creating a new dataset and adding files. Now, I have a large dataset. I can download whole data to local server and start training. Let’s say I found out that data in data_from_source_C
has some issue. I want to let data engineer from my team download exactly this folder and fix issue (it can be anything). How to do this without downloading whole dataset?
I want to download an exact folder/batch of the dataset to my local machine to check data out without downloading whole dataset.
TeenyBeetle18 the closest you can get is to download only one part of the dataset, if this is a multi part dataset (i.e. the dataset version is larger than the default 500MB, so you have multiple izp files, and you just want to download one of them, not all of them).
This can actually be achieved with:Dataset.get_local_copy(..., part=0)
https://github.com/allegroai/clearml/blob/717edba8c2b39fb7486bd2aba9ca0294f309b4c3/clearml/datasets/dataset.py#L683
If the data is updated into the same local / network folder structure, which serves as a dataset's single point of truth, you can schedule a script which uses the dataset sync
functionality which will update the dataset based on the modifications made to the folder.
You can then modify precisely what you need in that structure, and get a new updated dataset version
Thank you, it good way to handle it. Of course, it would be great to have such func in clear ml. Only this stops me from deployment.