Hi PanickyMoth78
There is indeed a versioning mechanism available for the open source version 🎉
The datasets keep track of their "genealogy" so you can easily access the version that you need through its ID
In order to create a child dataset, you simply have to use the parameter "parent_datasets" when you create your dataset : have a look at
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate
You also alternatively squash datasets together to create a child version
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash
We are currently creating detailed examples on the open source datasets. They should be available soon 🙂
thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?
PanickyMoth78 , if I'm not mistaken that should be the mechanism. I'll look into that 🙂
oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now
This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.
console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get
on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files
uploads are a bit slow though (~4 minutes for 50mb)