Hm OK 🤔
I am not sure whether it's heresy to say that here, but why wouldn't you use a mechanism comparable to what DVC does in the backend?
When you create a dataset, you could hash the individual files and upload them to a cache. Datasets are then groupings of file hashes. When you want to download a dataset, all you have to do is reproduce the folder structure with the files identified by hashes.
This way, it does not matter whether you recreate a dataset with the same files, they would not be reuploaded/downloaded if the hash is the same. And partial/full overlaps would not even have to be defined explicitly.
I know clearml-data
has the paradigm "Data is Not Code" and that is fine. You don't need to take the checking in etc. of DVC but the caching architecture of DVC seems pretty cool to me.
If I checkout/download dataset D on a new machine, it will have to download/extract 15GB worth of data instead of 3GB, right? At least I cannot imagine how you would extract the 3GB of individual files out of zip archives on S3.
Yes, I'm not sure there is an interface to extract only partial files from the zip (although worth checking).
I also remember there is a GitHub issue with uploading 50GB dataset, and the bottom line is, we should support setting chuck size, so that we can upload/download smaller chunks of the entire dataset. wdyt ?
Thank you for the hint with Dataset.sync
and the explanation AgitatedDove14 🙂
The interfaces look alright. I think we are rather concerned about the performance of a backend implementation detail - but maybe I misunderstood?
When I create a dataset with say 5GB of images, it will be uploaded to the server/cloud as one .zip
archive. Let's say I now create several 5GB datasets A, B, C and then want to create a new dataset D that inherits 1GB each of A, B, C. If I checkout/download dataset D on a new machine, it will have to download/extract 15GB worth of data instead of 3GB, right? At least I cannot imagine how you would extract the 3GB of individual files out of zip archives on S3.
Hi EagerOtter28
Let's say we query another time and get 60k images. Now it is not trivial to create a new dataset B but only upload the diff: ...
Use Dataset.sync (or clearml-data sync) to check which files where changed/added.
All files are already hashed, right? I wonder whyÂ
clearml-data
 does not keep files in a semi-flat hierarchy and groups them together to datasets?
It kind of does, it has a full listing of all the files with their hash (SHA2) values, for all the files in a version (including reference to the owner version, so it can immediately know which dataset versions it needs to download, and how to link to them.
I think we are missing some interface for you to fully implement you use case, check here:
https://github.com/allegroai/clearml/blob/6a91374c2dd177b7bdf4c43efca8e6fb0d432648/clearml/datasets/dataset.py#L47
and let me know what do you think is missing
Hey Alon, thank you for the quick response! 🙂 This clarifies some points, we also experimented a little more now with it.
Our use-cases are unfortunately not completely covered I guess.
Let's say we have a pool of >300k images and growing. With queries in a database, we identify 80k that should form a dataset. We can create a dataset A and have it stored in the cloud, managed by clearml-data
. Let's say we query another time and get 60k images. Now it is not trivial to create a new dataset B but only upload the diff: What we would need to do would be to declare the first dataset as parent, remove all images in A that are not in B and add the new B images. Even if we went through this procedure, the complete dataset A would need to be downloaded (since it is a compressed .zip
) to reuse only a fraction of it. This would not scale well I guess.
All files are already hashed, right? I wonder why clearml-data
does not keep files in a semi-flat hierarchy and groups them together to datasets? This way, the same file would only be up/downloaded once if the hash checks out, even if the datasets are in no relationship.
Hi EagerOtter28 ,
The integration with cloud backing worked out of the box so that was a smooth experience so farÂ
Great to read 🙂
When I create a dataset with 10 files and have it uploaded to e.g. S3 and then create a new dataset with the same files in a different folder structure, all files are reuploadedÂ
 For a few .csv files, it does not matter, but we have datasets in the 100GB-2TB range.
Any specific reason for uploading the same dataset twice? clearml-data
will create different task with different zip file for each dataset instance.
If I make a dataset a child of another dataset, will this avoid reuploading?
Yes it should only add the diff files.
Will clearml-data understand that it already holds a local copy of a file if the same file (with the same hash) is part of two datasets?
If its from two different dataset, clearml-data
will download each of them