Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Is There Any Way To Upload Data To A Clearml Dataset Without Compression At All? I Have Very Small Text Files That Make Up A Dataset And Compression Seems To Take Most Of The Upload Time And It Provide Almost No Benefits W.R.T Size

Hi,
is there any way to upload data to a clearml dataset without compression at all? I have very small text files that make up a dataset and compression seems to take most of the upload time and it provide almost no benefits w.r.t size

  
  
Posted 2 years ago
Votes Newest

Answers 15


AgitatedDove14 thanks for the tip. I tried it but it still compresses the data to zip format. I suspect it is since ZIP_STORED is a constant that equals 0:

This is problematic due to line 689 in dataset.py ( https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L689 ):
compression=compression or ZIP_DEFLATED

Since ZIP_DEFLATED=8, passing compression=0 still causes data to be compressed

  
  
Posted 2 years ago

The default compression parameter value is ZIP_MINIMAL_COMPRESSION , I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries

  
  
Posted 2 years ago

HugeArcticwolf77 from the CLI you cannot control it (but we could probably add that), from code you can:
https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L646
pass compression=ZIP_STORED

  
  
Posted 2 years ago

sure, that's slightly more elegant. I'll open a PR now

  
  
Posted 2 years ago

BTW:

I have very small text files that make up a dataset and compression seems to take most of the upload time

How long does it take? and how come it is not smaller in size ?

  
  
Posted 2 years ago

Just dropping this here but I've had some funky compressions with very small datasets! It's not a big issue though, since it's still small and doesn't really affect anything

  
  
Posted 2 years ago

Hi HugeArcticwolf77
I'v run the following code - which uploads the files with compression, although compression=None

ds.upload(show_progress=True, verbose=True, output_url='

', compression=None)
ds.finalize(verbose=True, auto_upload=True)

Any idea way?

  
  
Posted one year ago

AgitatedDove14 I have suggested fix, should I open an issue on GitHub, or can I directly open a PR?
compression=compression or ZIP_DEFLATED if compression else ZIP_STORED

  
  
Posted 2 years ago

As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞

  
  
Posted 2 years ago

Just dropping this here but I've had some funky compressions with very small datasets!

Odd deflate behavior ...?!

  
  
Posted 2 years ago

compression=ZIP_DEFLATED if compression is None else compressionwdyt?

  
  
Posted 2 years ago

or can I directly open a PR?

Open a direct PR and link to this thread, I will make sure it is passed along 🙂

  
  
Posted 2 years ago

OutrageousSheep60 passing None means using default compression. You need to pass compression=0

  
  
Posted one year ago

It takes around 3-5mins to upload 100-500k plain text files. I just assumed that the added size includes the entire dataset including metadata, SHA2 and other stuff required for dataset functionallity

  
  
Posted 2 years ago

HugeArcticwolf77 oh no, I think you are correct 😞
Do you want to quickly PR a fix ?

  
  
Posted 2 years ago