Hm OK 🤔
I am not sure whether it's heresy to say that here, but why wouldn't you use a mechanism comparable to what DVC does in the backend?
When you create a dataset, you could hash the individual files and upload them to a cache. Datasets are then groupings of file hashes. When you want to download a dataset, all you have to do is reproduce the folder structure with the files identified by hashes.
This way, it does not matter whether you recreate a dataset with the same files, they would not be reuploaded/downloaded if the hash is the same. And partial/full overlaps would not even have to be defined explicitly.
I know clearml-data
has the paradigm "Data is Not Code" and that is fine. You don't need to take the checking in etc. of DVC but the caching architecture of DVC seems pretty cool to me.