Hi I'M Looking Into How Clearml Supports Datasets And Dataset Versioning And I'M A Bit Confused. Is Dataset Versioning Not Supported At All In The Non-Enterprise Or Is Versioning Available By A Different Mechanism? I See That

Answered

Hi
I'm looking into how clearml supports datasets and dataset versioning and I'm a bit confused.

Is dataset versioning not supported at all in the non-enterprise or is versioning available by a different mechanism? I see that Dataset.create takes parent datasets. Is that a way of making versions? (i.e. create new from parent and add files?)
Is there some usage example code available that shows how tasks can access newer and older versions of a dataset (e.g. before and after additional data was added)?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 7

Hi PanickyMoth78
There is indeed a versioning mechanism available for the open source version 🎉

The datasets keep track of their "genealogy" so you can easily access the version that you need through its ID

In order to create a child dataset, you simply have to use the parameter "parent_datasets" when you create your dataset : have a look at
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate

You also alternatively squash datasets together to create a child version
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash

We are currently creating detailed examples on the open source datasets. They should be available soon 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SweetBadger76
				
					0
					 × 1

thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

PanickyMoth78 , if I'm not mistaken that should be the mechanism. I'll look into that 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

uploads are a bit slow though (~4 minutes for 50mb)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Write your answer

1K Views

7 Answers

2 years ago

one year ago