uploads are a bit slow though (~4 minutes for 50mb)
console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get
on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files
This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.
oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now
PanickyMoth78 , if I'm not mistaken that should be the mechanism. I'll look into that 🙂
thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?
Hi PanickyMoth78
There is indeed a versioning mechanism available for the open source version 🎉
The datasets keep track of their "genealogy" so you can easily access the version that you need through its ID
In order to create a child dataset, you simply have to use the parameter "parent_datasets" when you create your dataset : have a look at
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate
You also alternatively squash datasets together to create a child version
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash
We are currently creating detailed examples on the open source datasets. They should be available soon 🙂