Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
More Clarification On Documentation (Clearml Data):

More clarification on documentation (ClearML Data):

Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.

This differentiable storage - does it only work on file additions/removal, or also on intra-file changes?
In other words, does it store the delta as "add/remove these files", or also "these lines were added to this CSV file", etc?

  
  
Posted 2 years ago
Votes Newest

Answers 10


Yes it would be 🙂
Visualization is always a difficult topic... I'm not sure about that, but a callback would be nice.

One idea that comes to mind (this is of course limited to DataFrames), but think the git diff , where I imagine 3 independent section:
Removed columns (+ truncated preview of removed values) (see below) Added columns (+ truncated preview of removed values)
The middle column is then a bit complicated, but I would see some kind of "shared columns" dataframe, where each cell (that has changed) would be split into two - one original value (in red?) and one new value (in green?)
New rows would have --- as original value, deleted rows would have --- as new value (or some value that indicates "does/did not exist")

  
  
Posted 2 years ago

Hi UnevenDolphin73

This differentiable storage - does it only work on file additions/removal, or also on intra-file changes?

This is on a file level, meaning you change a single byte in the file, the entire file will be packaged in the new version.
Make sense ?

  
  
Posted 2 years ago

Right so this is checksum based? Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?

  
  
Posted 2 years ago

Right so this is checksum based?

correct

Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?

Long story short, no 😞
Basically delta changes are not scaleable. and work only in text based files, see git, and breaks very quickly when large files are involved, see the fun of git-lfs ...
Does that make sense? is there a specific reason you are thinking about byte granularity ?

  
  
Posted 2 years ago

That's an interesting question. I'm pretty sure file deltas aren't saved (Although you do get file sizes so you might see changes there)
Let me check if maybe they are saved somehow or if that information can be extrapolated somehow 🙂

  
  
Posted 2 years ago

  1. look at immediate parents for identically-named files
    ....

UnevenDolphin73 are you saying this will be your way to log the diff between two versions (for increased visibility) ?
If so, how would you visualize it ?
(I really like this idea of visualizing the changeset, trying to think if there is "smart" way to create a callback to make the approach kind of best-practice) wdyt?

  
  
Posted 2 years ago

Parquet file in this instance (used to be CSV, but that was even larger as everything is stored as a string...)

  
  
Posted 2 years ago

What type of file is it?

  
  
Posted 2 years ago

Would be great if it is 😍 We have few files that change frequently and are quite large in size, and it would be quite a storage hit to save all of them

  
  
Posted 2 years ago

Just because it's handy to compare differences and see how the data changed between iterations, but I guess we'll work with that 🙂
We'll probably do something like:
When creating a new dataset with a parent (or parents), look at immediate parents for identically-named files If those exist, load those with matching framework (pyarrow, pandas, etc), and log differences to the new dataset 🙂

  
  
Posted 2 years ago