ClearML FAQ | Clearml-Data - Incremental Changes And Hashing On Per-File Basis?

Answered

Clearml-Data - Incremental Changes And Hashing On Per-File Basis?

clearml-data - incremental changes and hashing on per-file basis?
Hi! In our team, we are now also looking at clearml-data . The integration with cloud backing worked out of the box so that was a smooth experience so far 🙂 Though we have some pain points where I wanted to know if we use it in the wrong way of if those things are not possible:
When I create a dataset with 10 files and have it uploaded to e.g. S3 and then create a new dataset with the same files in a different folder structure, all files are reuploaded 🤔 For a few .csv files, it does not matter, but we have datasets in the 100GB-2TB range. If I make a dataset a child of another dataset, will this avoid reuploading? Will clearml-data understand that it already holds a local copy of a file if the same file (with the same hash) is part of two datasets?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					EagerOtter28
				
					0
					 × 1

Votes Newest

Answers 6

Hi EagerOtter28 ,

The integration with cloud backing worked out of the box so that was a smooth experience so far

Great to read 🙂

When I create a dataset with 10 files and have it uploaded to e.g. S3 and then create a new dataset with the same files in a different folder structure, all files are reuploaded

For a few .csv files, it does not matter, but we have datasets in the 100GB-2TB range.

Any specific reason for uploading the same dataset twice? clearml-data will create different task with different zip file for each dataset instance.

If I make a dataset a child of another dataset, will this avoid reuploading?

Yes it should only add the diff files.

Will clearml-data understand that it already holds a local copy of a file if the same file (with the same hash) is part of two datasets?

If its from two different dataset, clearml-data will download each of them

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Hey Alon, thank you for the quick response! 🙂 This clarifies some points, we also experimented a little more now with it.

Our use-cases are unfortunately not completely covered I guess.
Let's say we have a pool of >300k images and growing. With queries in a database, we identify 80k that should form a dataset. We can create a dataset A and have it stored in the cloud, managed by clearml-data . Let's say we query another time and get 60k images. Now it is not trivial to create a new dataset B but only upload the diff: What we would need to do would be to declare the first dataset as parent, remove all images in A that are not in B and add the new B images. Even if we went through this procedure, the complete dataset A would need to be downloaded (since it is a compressed .zip ) to reuse only a fraction of it. This would not scale well I guess.
All files are already hashed, right? I wonder why clearml-data does not keep files in a semi-flat hierarchy and groups them together to datasets? This way, the same file would only be up/downloaded once if the hash checks out, even if the datasets are in no relationship.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					EagerOtter28
				
					0
					 × 1

Hi EagerOtter28

Let's say we query another time and get 60k images. Now it is not trivial to create a new dataset B but only upload the diff: ...

Use Dataset.sync (or clearml-data sync) to check which files where changed/added.

All files are already hashed, right? I wonder why

clearml-data

does not keep files in a semi-flat hierarchy and groups them together to datasets?

It kind of does, it has a full listing of all the files with their hash (SHA2) values, for all the files in a version (including reference to the owner version, so it can immediately know which dataset versions it needs to download, and how to link to them.
I think we are missing some interface for you to fully implement you use case, check here:
https://github.com/allegroai/clearml/blob/6a91374c2dd177b7bdf4c43efca8e6fb0d432648/clearml/datasets/dataset.py#L47
and let me know what do you think is missing

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you for the hint with Dataset.sync and the explanation AgitatedDove14 🙂
The interfaces look alright. I think we are rather concerned about the performance of a backend implementation detail - but maybe I misunderstood?
When I create a dataset with say 5GB of images, it will be uploaded to the server/cloud as one .zip archive. Let's say I now create several 5GB datasets A, B, C and then want to create a new dataset D that inherits 1GB each of A, B, C. If I checkout/download dataset D on a new machine, it will have to download/extract 15GB worth of data instead of 3GB, right? At least I cannot imagine how you would extract the 3GB of individual files out of zip archives on S3.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					EagerOtter28
				
					0
					 × 1

If I checkout/download dataset D on a new machine, it will have to download/extract 15GB worth of data instead of 3GB, right? At least I cannot imagine how you would extract the 3GB of individual files out of zip archives on S3.

Yes, I'm not sure there is an interface to extract only partial files from the zip (although worth checking).
I also remember there is a GitHub issue with uploading 50GB dataset, and the bottom line is, we should support setting chuck size, so that we can upload/download smaller chunks of the entire dataset. wdyt ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hm OK 🤔
I am not sure whether it's heresy to say that here, but why wouldn't you use a mechanism comparable to what DVC does in the backend?

When you create a dataset, you could hash the individual files and upload them to a cache. Datasets are then groupings of file hashes. When you want to download a dataset, all you have to do is reproduce the folder structure with the files identified by hashes.

This way, it does not matter whether you recreate a dataset with the same files, they would not be reuploaded/downloaded if the hash is the same. And partial/full overlaps would not even have to be defined explicitly.

I know clearml-data has the paradigm "Data is Not Code" and that is fine. You don't need to take the checking in etc. of DVC but the caching architecture of DVC seems pretty cool to me.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					EagerOtter28
				
					0
					 × 1

Write your answer

2K Views

6 Answers

4 years ago

2 years ago