Hi SubstantialElk6
but in terms of data provenance, its not clear how i can associate the data versions with the processes that created it.
I think DeliciousBluewhale87 ’s approach is what we are aiming for, but with code.
So using clearml-data
from CLI is basically storing/versioning of files (with differentiable based storage etc, but still).
What ou are after (I think) is in your preprocessing code using the programtic Dataset class, to create the Dataset from code, this allows you to both have the storage capabilities and versioning, but also to couple it with the preprocessing code for provenance and automation.
The base assumption is that Dataset is always a Task (with artifacts and fancy interface), but a Task nonetheless, and this gives you all the capabilities of a Task, such as adding metrics/stats on the Data, automation with pipeline etc, but also the ability to later retrieve the data with simple CLI or code.
wdyt?