Hi, Is There Any Way To Upload Data To A Clearml Dataset Without Compression At All? I Have Very Small Text Files That Make Up A Dataset And Compression Seems To Take Most Of The Upload Time And It Provide Almost No Benefits W.R.T Size

Answered

Hi,
is there any way to upload data to a clearml dataset without compression at all? I have very small text files that make up a dataset and compression seems to take most of the upload time and it provide almost no benefits w.r.t size

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					HugeArcticwolf77
				
					0
					 × 1

Votes Newest

Answers 15

OutrageousSheep60 passing None means using default compression. You need to pass compression=0

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					HugeArcticwolf77
				
					0
					 × 1

Hi HugeArcticwolf77
I'v run the following code - which uploads the files with compression, although compression=None

ds.upload(show_progress=True, verbose=True, output_url='

', compression=None)
ds.finalize(verbose=True, auto_upload=True)

Any idea way?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					OutrageousSheep60
				
					0
					 × 1

Just dropping this here but I've had some funky compressions with very small datasets!

Odd deflate behavior ...?!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

sure, that's slightly more elegant. I'll open a PR now

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					HugeArcticwolf77
				
					0
					 × 1

compression=ZIP_DEFLATED if compression is None else compressionwdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

or can I directly open a PR?

Open a direct PR and link to this thread, I will make sure it is passed along 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I have suggested fix, should I open an issue on GitHub, or can I directly open a PR?
compression=compression or ZIP_DEFLATED if compression else ZIP_STORED

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					HugeArcticwolf77
				
					0
					 × 1

It takes around 3-5mins to upload 100-500k plain text files. I just assumed that the added size includes the entire dataset including metadata, SHA2 and other stuff required for dataset functionallity

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					HugeArcticwolf77
				
					0
					 × 1

Just dropping this here but I've had some funky compressions with very small datasets! It's not a big issue though, since it's still small and doesn't really affect anything

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmugSnake6
				
					0
					 × 1

BTW:

I have very small text files that make up a dataset and compression seems to take most of the upload time

How long does it take? and how come it is not smaller in size ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

HugeArcticwolf77 oh no, I think you are correct 😞
Do you want to quickly PR a fix ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 thanks for the tip. I tried it but it still compresses the data to zip format. I suspect it is since ZIP_STORED is a constant that equals 0:

This is problematic due to line 689 in dataset.py ( https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L689 ):
compression=compression or ZIP_DEFLATED

Since ZIP_DEFLATED=8, passing compression=0 still causes data to be compressed

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					HugeArcticwolf77
				
					0
					 × 1

HugeArcticwolf77 from the CLI you cannot control it (but we could probably add that), from code you can:
https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L646
pass compression=ZIP_STORED

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The default compression parameter value is ZIP_MINIMAL_COMPRESSION , I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Write your answer

2K Views

15 Answers

3 years ago

2 years ago