DepressedFish57 , Hi 🙂
What do you mean by downloading a previous part of the dataset? get_local_copy
fetches the entire dataset if I'm not mistaken. Am I missing something?
Hi AnxiousSeal95 ! Thank you for the parallel download feature you have added. We have tested it with ClearML 1.5.0 and it seems that it is not really helpful. For a big dataset the time it takes to download does not really changes with the new feature. We indeed can download several chunks in parallel, but it turns out that while N workers are downloading N chunks, downloading speed for each worker is N times less than for a consequent download. Since that most of the chunks have the same size, the download processes finish at the same time and parallel zip extraction process runs, again about N times slower than for a single worker mode. Such an approach does not seem to be the most efficient one. We discussed it with DepressedFish57 and we suggest you do it in another way. We propose to modify the approach as follows: 1) we download first chunk; 2) downloading of the second chunk starts at the same time when extraction of the first chunk begins. It will allow to download each chunk at the maximum possible speed and perform the CPU-dependent extraction processes in parallel, which will indeed enable to download out data faster.
ExcitedSeaurchin87 I took a quick look, dude this is awesome!!! Thank you 🤩
Hi DepressedFish57 clearml v1.5.0 (pip install clearml==1.5.0) is out with a fix for this issue 🙂
Let us know if it works as expected 😄
Hi DepressedFish57 AgitatedDove14 AnxiousSeal95 ! It's me again. I created a https://github.com/allegroai/clearml/pull/713 where I slightly changed the dataset loading code, ran some tests to estimate dataset loading times using the current and proposed approaches and tried to explain in detail the problem I see.
Hi DepressedFish57
In my case download each part takes ~5 second, and unzip ~15.
We run into that, and the new version will employ multithreading approach for the unzip (meaning the unzipping will happen in the background)
Could you please point me to the piece of ClearML code related to the downloading process?
Dataset for storing splits to many parts, each around 500Mb. And when you called get_local_copy
it starts consistently download and unpack each part.
In my case download each part takes ~5 second, and unzip ~15.
Yes AgitatedDove14 , I mean multithreading. I did not quite understand your question about single Dataset version. Could you clarify the question for me, please?
At least from the logs I see in my terminal I assume that right now downloading works as on the scheme 2. This one:
. Could you clarify the question for me, please?
...
Could you please point me to the piece of ClearML code related to the downloading process?
I think I mean this part:
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/datasets/dataset.py#L2134
Here is the schemes of discussed variants (1. download process before ClearML 1.5.0; 2. current version; 3. proposed method). I hope they will be helpful.
Hi DepressedFish57 , as Martin said either the next version or the next-next version will have this feature 😄 We'll update here when it's out 🙂
ExcitedSeaurchin87 can I assume in parallel means threads ?
Also, is this a single Dataset version download? at least in theory option (3) is the new default in the latest clearml version. wdyt?
Or, we can download chunks in parallel, like we do it right now, but we have to prioritize the download of the earlier chunks to make extraction and downloading run in parallel.