Hi SubstantialElk6
but in terms of data provenance, its not clear how i can associate the data versions with the processes that created it.
I think DeliciousBluewhale87 ’s approach is what we are aiming for, but with code.
So using clearml-data
from CLI is basically storing/versioning of files (with differentiable based storage etc, but still).
What ou are after (I think) is in your preprocessing code using the programtic Dataset class, to create the Dataset from code, this allows you to both have the storage capabilities and versioning, but also to couple it with the preprocessing code for provenance and automation.
The base assumption is that Dataset is always a Task (with artifacts and fancy interface), but a Task nonetheless, and this gives you all the capabilities of a Task, such as adding metrics/stats on the Data, automation with pipeline etc, but also the ability to later retrieve the data with simple CLI or code.
wdyt?
Hi erez, i think i would want to reference the code that transformed the data. Take for example, i received 10k images, i performed some transformation and save it as a next version before i split it up for my ML training. Some time later, i receive a new set of 10k images and wants to apply the same transformation and then append it to the previous 10k as another version. Clearml-data does well for the data-versioning part, but in terms of data provenance, its not clear how i can associate the data versions with the processes that created it.
Hi Jax, I'm working on a few more examples of how to use clearml-data. should be released in a few weeks (with some other documentation updates). These however don't include the use case you're talking about. Would you care to elaborate more on that? Are you looking to store the code that created the data, in the execution part of the task that saves the data itself?
Hi, Some walk around I thought of.. Btw, I havent tried . AnxiousSeal95 , your comments
1 ) Attach a clearml-task id to each new dataset-id
So in the future, when new data comes in, get the last data commit from the project(Dataset) and get the clearml-task for it. Then clone the clearml-task, and pass in the new data. The only downside, is the need to clone the cleaml-task.
Or alternatively
2) Attach a gitsha-id of the processing code to each new dataset-id.
This can't give the exact code but least , which snapshot of the code base was used.