Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi I'M Looking Into How Clearml Supports Datasets And Dataset Versioning And I'M A Bit Confused. Is Dataset Versioning Not Supported At All In The Non-Enterprise Or Is Versioning Available By A Different Mechanism? I See That

Hi
I'm looking into how clearml supports datasets and dataset versioning and I'm a bit confused.

Is dataset versioning not supported at all in the non-enterprise or is versioning available by a different mechanism? I see that Dataset.create takes parent datasets. Is that a way of making versions? (i.e. create new from parent and add files?)
Is there some usage example code available that shows how tasks can access newer and older versions of a dataset (e.g. before and after additional data was added)?

  
  
Posted 2 years ago
Votes Newest

Answers 7


uploads are a bit slow though (~4 minutes for 50mb)

  
  
Posted 2 years ago

oops, I deleted two messages here because I had a bug in a test I've done.
I'm retesting now

  
  
Posted 2 years ago

Hi PanickyMoth78
There is indeed a versioning mechanism available for the open source version 🎉

The datasets keep track of their "genealogy" so you can easily access the version that you need through its ID

In order to create a child dataset, you simply have to use the parameter "parent_datasets" when you create your dataset : have a look at
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate

You also alternatively squash datasets together to create a child version
https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash

We are currently creating detailed examples on the open source datasets. They should be available soon 🙂

  
  
Posted 2 years ago

PanickyMoth78 , if I'm not mistaken that should be the mechanism. I'll look into that 🙂

  
  
Posted 2 years ago

console output shows uploads of 500 files on every new dataset. The lineage is as expected, each additional upload is the same size as the previous ones (~50mb) and Dataset.get on the last dataset's ID retreives all the files from the separate parts to one local folder.
Checking the remote storage location (gs://) shows artifact zip files, each with 500 files

  
  
Posted 2 years ago

This idea seems to work.
I tested this for a scenario where data is periodically added to a dataset and, to "version" the steps, I create a new dataset with the old as parent:
To do so, I split a set of image files into separate folders (pets_000, pets_001, ... pets_015), each with 500 image files
I then run the code here to make the datasets.

  
  
Posted 2 years ago

thanks. Seems like I was on the right path. Do datasets specified as parents need to be https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk/#finalizing-a-dataset ?

  
  
Posted 2 years ago
1K Views
7 Answers
2 years ago
one year ago
Tags