Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi! I'M Using Func

Hi! I'm using func get_local_copy for download dataset. Is it possible to simultaneously download and unzip previous part of dataset? Because now these phases are taking turns, and take a lot of time.

  
  
Posted 2 years ago
Votes Newest

Answers 14


DepressedFish57 , Hi 🙂

What do you mean by downloading a previous part of the dataset? get_local_copy fetches the entire dataset if I'm not mistaken. Am I missing something?

  
  
Posted 2 years ago

Hi AnxiousSeal95 ! Thank you for the parallel download feature you have added. We have tested it with ClearML 1.5.0 and it seems that it is not really helpful. For a big dataset the time it takes to download does not really changes with the new feature. We indeed can download several chunks in parallel, but it turns out that while N workers are downloading N chunks, downloading speed for each worker is N times less than for a consequent download. Since that most of the chunks have the same size, the download processes finish at the same time and parallel zip extraction process runs, again about N times slower than for a single worker mode. Such an approach does not seem to be the most efficient one. We discussed it with DepressedFish57 and we suggest you do it in another way. We propose to modify the approach as follows: 1) we download first chunk; 2) downloading of the second chunk starts at the same time when extraction of the first chunk begins. It will allow to download each chunk at the maximum possible speed and perform the CPU-dependent extraction processes in parallel, which will indeed enable to download out data faster.

  
  
Posted 2 years ago

ExcitedSeaurchin87 I took a quick look, dude this is awesome!!! Thank you 🤩

  
  
Posted 2 years ago

Hi DepressedFish57 clearml v1.5.0 (pip install clearml==1.5.0) is out with a fix for this issue 🙂
Let us know if it works as expected 😄

  
  
Posted 2 years ago

Hi DepressedFish57 AgitatedDove14 AnxiousSeal95 ! It's me again. I created a https://github.com/allegroai/clearml/pull/713 where I slightly changed the dataset loading code, ran some tests to estimate dataset loading times using the current and proposed approaches and tried to explain in detail the problem I see.

  
  
Posted 2 years ago

Hi DepressedFish57

In my case download each part takes ~5 second, and unzip ~15.

We run into that, and the new version will employ multithreading approach for the unzip (meaning the unzipping will happen in the background)

  
  
Posted 2 years ago

Could you please point me to the piece of ClearML code related to the downloading process?

  
  
Posted 2 years ago

Dataset for storing splits to many parts, each around 500Mb. And when you called get_local_copy it starts consistently download and unpack each part.
In my case download each part takes ~5 second, and unzip ~15.

  
  
Posted 2 years ago

Yes AgitatedDove14 , I mean multithreading. I did not quite understand your question about single Dataset version. Could you clarify the question for me, please?

At least from the logs I see in my terminal I assume that right now downloading works as on the scheme 2. This one:

  
  
Posted 2 years ago

. Could you clarify the question for me, please?
...
Could you please point me to the piece of ClearML code related to the downloading process?

I think I mean this part:
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/datasets/dataset.py#L2134

  
  
Posted 2 years ago

Here is the schemes of discussed variants (1. download process before ClearML 1.5.0; 2. current version; 3. proposed method). I hope they will be helpful.

  
  
Posted 2 years ago

Hi DepressedFish57 , as Martin said either the next version or the next-next version will have this feature 😄 We'll update here when it's out 🙂

  
  
Posted 2 years ago

ExcitedSeaurchin87 can I assume in parallel means threads ?
Also, is this a single Dataset version download? at least in theory option (3) is the new default in the latest clearml version. wdyt?

  
  
Posted 2 years ago

Or, we can download chunks in parallel, like we do it right now, but we have to prioritize the download of the earlier chunks to make extraction and downloading run in parallel.

  
  
Posted 2 years ago
1K Views
14 Answers
2 years ago
one year ago
Tags