BTW:
I have very small text files that make up a dataset and compression seems to take most of the upload time
How long does it take? and how come it is not smaller in size ?
sure, that's slightly more elegant. I'll open a PR now
OutrageousSheep60 passing None means using default compression. You need to pass compression=0
The default compression parameter value is ZIP_MINIMAL_COMPRESSION
, I guess you could try to check if there is a Tarball only option but anyway most of the CPU time took by the upload process is the generation of the hashes of the file entries
It takes around 3-5mins to upload 100-500k plain text files. I just assumed that the added size includes the entire dataset including metadata, SHA2 and other stuff required for dataset functionallity
Hi HugeArcticwolf77
I'v run the following code - which uploads the files with compression, although compression=None
ds.upload(show_progress=True, verbose=True, output_url='
', compression=None)
ds.finalize(verbose=True, auto_upload=True)
Any idea way?
AgitatedDove14 thanks for the tip. I tried it but it still compresses the data to zip format. I suspect it is since ZIP_STORED is a constant that equals 0:
This is problematic due to line 689 in dataset.py ( https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L689 ):compression=compression or ZIP_DEFLATED
Since ZIP_DEFLATED=8, passing compression=0 still causes data to be compressed
Just dropping this here but I've had some funky compressions with very small datasets! It's not a big issue though, since it's still small and doesn't really affect anything
compression=ZIP_DEFLATED if compression is None else compression
wdyt?
HugeArcticwolf77 oh no, I think you are correct 😞
Do you want to quickly PR a fix ?
AgitatedDove14 I have suggested fix, should I open an issue on GitHub, or can I directly open a PR?compression=compression or ZIP_DEFLATED if compression else ZIP_STORED
or can I directly open a PR?
Open a direct PR and link to this thread, I will make sure it is passed along 🙂
HugeArcticwolf77 from the CLI you cannot control it (but we could probably add that), from code you can:
https://github.com/allegroai/clearml/blob/d17903d4e9f404593ffc1bdb7b4e710baae54662/clearml/datasets/dataset.py#L646
pass compression=ZIP_STORED
Just dropping this here but I've had some funky compressions with very small datasets!
Odd deflate behavior ...?!
As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞