Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Is There Any Way To Upload Data To A Clearml Dataset Without Compression At All? I Have Very Small Text Files That Make Up A Dataset And Compression Seems To Take Most Of The Upload Time And It Provide Almost No Benefits W.R.T Size

Hi,
is there any way to upload data to a clearml dataset without compression at all? I have very small text files that make up a dataset and compression seems to take most of the upload time and it provide almost no benefits w.r.t size

  
  
Posted one year ago
Votes Newest

Answers 15


AgitatedDove14 thanks for the tip. I tried it but it still compresses the data to zip format. I suspect it is since ZIP_STORED is a constant that equals 0:

This is problematic due to line 689 in dataset.py ( https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L689 ):
compression=compression or ZIP_DEFLATED

Since ZIP_DEFLATED=8, passing compression=0 still causes data to be compressed

  
  
Posted one year ago

HugeArcticwolf77 from the CLI you cannot control it (but we could probably add that), from code you can:
https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L646
pass compression=ZIP_STORED

  
  
Posted one year ago

As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞

  
  
Posted one year ago

HugeArcticwolf77 oh no, I think you are correct 😞
Do you want to quickly PR a fix ?

  
  
Posted one year ago

Just dropping this here but I've had some funky compressions with very small datasets! It's not a big issue though, since it's still small and doesn't really affect anything

  
  
Posted one year ago

It takes around 3-5mins to upload 100-500k plain text files. I just assumed that the added size includes the entire dataset including metadata, SHA2 and other stuff required for dataset functionallity

  
  
Posted one year ago

AgitatedDove14 I have suggested fix, should I open an issue on GitHub, or can I directly open a PR?
compression=compression or ZIP_DEFLATED if compression else ZIP_STORED

  
  
Posted one year ago

or can I directly open a PR?

Open a direct PR and link to this thread, I will make sure it is passed along 🙂

  
  
Posted one year ago

Hi HugeArcticwolf77
I'v run the following code - which uploads the files with compression, although compression=None

ds.upload(show_progress=True, verbose=True, output_url='

', compression=None)
ds.finalize(verbose=True, auto_upload=True)

Any idea way?

  
  
Posted one year ago

Just dropping this here but I've had some funky compressions with very small datasets!

Odd deflate behavior ...?!

  
  
Posted one year ago

compression=ZIP_DEFLATED if compression is None else compressionwdyt?

  
  
Posted one year ago

The default compression parameter value is ZIP_MINIMAL_COMPRESSION , I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries

  
  
Posted one year ago

sure, that's slightly more elegant. I'll open a PR now

  
  
Posted one year ago

OutrageousSheep60 passing None means using default compression. You need to pass compression=0

  
  
Posted one year ago

BTW:

I have very small text files that make up a dataset and compression seems to take most of the upload time

How long does it take? and how come it is not smaller in size ?

  
  
Posted one year ago