Hi @<1543766544847212544:profile|SorePelican79> , ClearML can certainly do that. For this you have the Datasets feature.
None
This will allow you to version and track your data super easily 🙂
Hi John, thank you. However, I could not find a hint there how to versionize tablular data. Our data is essentially a huge data frame where each ground truth data point is a row with a unique id. How can I track in clearML that this and that row was part of experiment x because it belonged to test/training data set y?
Hi @<1543766544847212544:profile|SorePelican79> , I don't think you can track the data inside the dataset. Maybe @<1523701087100473344:profile|SuccessfulKoala55> , might have an idea
How can I track in clearML that this and that row was part of experiment x because it belonged to test/training data set y?
Hi @<1543766544847212544:profile|SorePelican79>
the experiments themselves will have a link to the Dataset they were using. From a dataset perspective, the idea is not to limit you, so essentially it will package all your files, and retrieve them when you fetch the datset. In terms of specifying a row / sample. My suggestion is to mark those rows when training and while training create a New version with those marked rows (or maybe just of the rows that you used). This new dataset version will also be linked to the creating Task, so you end up with full provenance and lineage of models/datasets , wdyt?