More Clarification On Documentation (Clearml Data):

Answered

More clarification on documentation (ClearML Data):

Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.

This differentiable storage - does it only work on file additions/removal, or also on intra-file changes?
In other words, does it store the delta as "add/remove these files", or also "these lines were added to this CSV file", etc?

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Votes Newest

Answers 10

Yes it would be 🙂
Visualization is always a difficult topic... I'm not sure about that, but a callback would be nice.

One idea that comes to mind (this is of course limited to DataFrames), but think the git diff , where I imagine 3 independent section:
Removed columns (+ truncated preview of removed values) (see below) Added columns (+ truncated preview of removed values)
The middle column is then a bit complicated, but I would see some kind of "shared columns" dataframe, where each cell (that has changed) would be split into two - one original value (in red?) and one new value (in green?)
New rows would have --- as original value, deleted rows would have --- as new value (or some value that indicates "does/did not exist")

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Parquet file in this instance (used to be CSV, but that was even larger as everything is stored as a string...)

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Would be great if it is 😍 We have few files that change frequently and are quite large in size, and it would be quite a storage hit to save all of them

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Just because it's handy to compare differences and see how the data changed between iterations, but I guess we'll work with that 🙂
We'll probably do something like:
When creating a new dataset with a parent (or parents), look at immediate parents for identically-named files If those exist, load those with matching framework (pyarrow, pandas, etc), and log differences to the new dataset 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Hi UnevenDolphin73

This differentiable storage - does it only work on file additions/removal, or also on intra-file changes?

This is on a file level, meaning you change a single byte in the file, the entire file will be packaged in the new version.
Make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Right so this is checksum based? Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

What type of file is it?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Right so this is checksum based?

correct

Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?

Long story short, no 😞
Basically delta changes are not scaleable. and work only in text based files, see git, and breaks very quickly when large files are involved, see the fun of git-lfs ...
Does that make sense? is there a specific reason you are thinking about byte granularity ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That's an interesting question. I'm pretty sure file deltas aren't saved (Although you do get file sizes so you might see changes there)
Let me check if maybe they are saved somehow or if that information can be extrapolated somehow 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

look at immediate parents for identically-named files
....

UnevenDolphin73 are you saying this will be your way to log the diff between two versions (for increased visibility) ?
If so, how would you visualize it ?
(I really like this idea of visualizing the changeset, trying to think if there is "smart" way to create a callback to make the approach kind of best-practice) wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

10 Answers

3 years ago

2 years ago