Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi I'M Looking Into How Clearml Supports Datasets And Dataset Versioning And I'M A Bit Confused. Is Dataset Versioning Not Supported At All In The Non-Enterprise Or Is Versioning Available By A Different Mechanism? I See That

Hi
I'm looking into how clearml supports datasets and dataset versioning and I'm a bit confused.

Is dataset versioning not supported at all in the non-enterprise or is versioning available by a different mechanism? I see that Dataset.create takes parent datasets. Is that a way of making versions? (i.e. create new from parent and add files?)
Is there some usage example code available that shows how tasks can access newer and older versions of a dataset (e.g. before and after additional data was added)?

  
  
Posted one year ago
Votes Newest

Answers 7


Hi PanickyMoth78
There is indeed a versioning mechanism available for the open source version 🎉

The datasets keep track of their "genealogy" so you can easily access the version that you need through its ID

In order to create a child dataset, you simply have to use the parameter "parent_datasets" when you create your dataset : have a look at
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate

You also alternatively squash datasets together to create a child version
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash

We are currently creating detailed examples on the open source datasets. They should be available soon 🙂

  
  
Posted one year ago

thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?

  
  
Posted one year ago

oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now

  
  
Posted one year ago

PanickyMoth78 , if I'm not mistaken that should be the mechanism. I'll look into that 🙂

  
  
Posted one year ago

uploads are a bit slow though (~4 minutes for 50mb)

  
  
Posted one year ago

console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files

  
  
Posted one year ago

This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.

  
  
Posted one year ago
605 Views
7 Answers
one year ago
one year ago
Tags