Yes it would be 🙂
Visualization is always a difficult topic... I'm not sure about that, but a callback would be nice.
One idea that comes to mind (this is of course limited to DataFrames), but think the git diff
, where I imagine 3 independent section:
Removed columns (+ truncated preview of removed values) (see below) Added columns (+ truncated preview of removed values)
The middle column is then a bit complicated, but I would see some kind of "shared columns" dataframe, where each cell (that has changed) would be split into two - one original value (in red?) and one new value (in green?)
New rows would have ---
as original value, deleted rows would have ---
as new value (or some value that indicates "does/did not exist")
Hi UnevenDolphin73
This differentiable storage - does it only work on file additions/removal, or also on intra-file changes?
This is on a file level, meaning you change a single byte in the file, the entire file will be packaged in the new version.
Make sense ?
Right so this is checksum based? Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?
Right so this is checksum based?
correct
Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?
Long story short, no 😞
Basically delta changes are not scaleable. and work only in text based files, see git, and breaks very quickly when large files are involved, see the fun of git-lfs ...
Does that make sense? is there a specific reason you are thinking about byte granularity ?
That's an interesting question. I'm pretty sure file deltas aren't saved (Although you do get file sizes so you might see changes there)
Let me check if maybe they are saved somehow or if that information can be extrapolated somehow 🙂
- look at immediate parents for identically-named files
....
UnevenDolphin73 are you saying this will be your way to log the diff between two versions (for increased visibility) ?
If so, how would you visualize it ?
(I really like this idea of visualizing the changeset, trying to think if there is "smart" way to create a callback to make the approach kind of best-practice) wdyt?
Parquet file in this instance (used to be CSV, but that was even larger as everything is stored as a string...)
Would be great if it is 😍 We have few files that change frequently and are quite large in size, and it would be quite a storage hit to save all of them
Just because it's handy to compare differences and see how the data changed between iterations, but I guess we'll work with that 🙂
We'll probably do something like:
When creating a new dataset with a parent (or parents), look at immediate parents for identically-named files If those exist, load those with matching framework (pyarrow, pandas, etc), and log differences to the new dataset 🙂