Hi EagerOtter28
Let's say we query another time and get 60k images. Now it is not trivial to create a new dataset B but only upload the diff: ...
Use Dataset.sync (or clearml-data sync) to check which files where changed/added.
All files are already hashed, right? I wonder why
clearml-data
does not keep files in a semi-flat hierarchy and groups them together to datasets?
It kind of does, it has a full listing of all the files with their hash (SHA2) values, for all the files in a version (including reference to the owner version, so it can immediately know which dataset versions it needs to download, and how to link to them.
I think we are missing some interface for you to fully implement you use case, check here:
https://github.com/allegroai/clearml/blob/6a91374c2dd177b7bdf4c43efca8e6fb0d432648/clearml/datasets/dataset.py#L47
and let me know what do you think is missing