HugeArcticwolf77 oh no, I think you are correct 😞
Do you want to quickly PR a fix ?
BTW:
I have very small text files that make up a dataset and compression seems to take most of the upload time
How long does it take? and how come it is not smaller in size ?
Just dropping this here but I've had some funky compressions with very small datasets! It's not a big issue though, since it's still small and doesn't really affect anything
HugeArcticwolf77 from the CLI you cannot control it (but we could probably add that), from code you can:
https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L646
pass compression=ZIP_STORED
sure, that's slightly more elegant. I'll open a PR now
As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞
compression=ZIP_DEFLATED if compression is None else compression
wdyt?
or can I directly open a PR?
Open a direct PR and link to this thread, I will make sure it is passed along 🙂
AgitatedDove14 I have suggested fix, should I open an issue on GitHub, or can I directly open a PR?compression=compression or ZIP_DEFLATED if compression else ZIP_STORED
OutrageousSheep60 passing None means using default compression. You need to pass compression=0
AgitatedDove14 thanks for the tip. I tried it but it still compresses the data to zip format. I suspect it is since ZIP_STORED is a constant that equals 0:
This is problematic due to line 689 in dataset.py ( https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L689 ):compression=compression or ZIP_DEFLATED
Since ZIP_DEFLATED=8, passing compression=0 still causes data to be compressed
It takes around 3-5mins to upload 100-500k plain text files. I just assumed that the added size includes the entire dataset including metadata, SHA2 and other stuff required for dataset functionallity
The default compression parameter value is ZIP_MINIMAL_COMPRESSION
, I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries
Hi HugeArcticwolf77
I'v run the following code - which uploads the files with compression, although compression=None
ds.upload(show_progress=True, verbose=True, output_url='
', compression=None)
ds.finalize(verbose=True, auto_upload=True)
Any idea way?
Just dropping this here but I've had some funky compressions with very small datasets!
Odd deflate behavior ...?!