Hi! I'M Using Func

Answered

Hi! I'M Using Func

Hi! I'm using func get_local_copy for download dataset. Is it possible to simultaneously download and unzip previous part of dataset? Because now these phases are taking turns, and take a lot of time.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DepressedFish57
				
					0
					 × 1

Votes Newest

Answers 14

Or, we can download chunks in parallel, like we do it right now, but we have to prioritize the download of the earlier chunks to make extraction and downloading run in parallel.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExcitedSeaurchin87
				
					0
					 × 1

Hi DepressedFish57 clearml v1.5.0 (pip install clearml==1.5.0) is out with a fix for this issue 🙂
Let us know if it works as expected 😄

  				
Posted 
	2 years ago

					More  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Hi DepressedFish57 AgitatedDove14 AnxiousSeal95 ! It's me again. I created a https://github.com/allegroai/clearml/pull/713 where I slightly changed the dataset loading code, ran some tests to estimate dataset loading times using the current and proposed approaches and tried to explain in detail the problem I see.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExcitedSeaurchin87
				
					0
					 × 1

Dataset for storing splits to many parts, each around 500Mb. And when you called get_local_copy it starts consistently download and unpack each part.
In my case download each part takes ~5 second, and unzip ~15.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DepressedFish57
				
					0
					 × 1

Yes AgitatedDove14 , I mean multithreading. I did not quite understand your question about single Dataset version. Could you clarify the question for me, please?

At least from the logs I see in my terminal I assume that right now downloading works as on the scheme 2. This one:

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExcitedSeaurchin87
				
					0
					 × 1

ExcitedSeaurchin87 can I assume in parallel means threads ?
Also, is this a single Dataset version download? at least in theory option (3) is the new default in the latest clearml version. wdyt?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Here is the schemes of discussed variants (1. download process before ClearML 1.5.0; 2. current version; 3. proposed method). I hope they will be helpful.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExcitedSeaurchin87
				
					0
					 × 1

ExcitedSeaurchin87 I took a quick look, dude this is awesome!!! Thank you 🤩

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

. Could you clarify the question for me, please?
...
Could you please point me to the piece of ClearML code related to the downloading process?

I think I mean this part:
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/datasets/dataset.py#L2134

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi DepressedFish57 , as Martin said either the next version or the next-next version will have this feature 😄 We'll update here when it's out 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Hi DepressedFish57

In my case download each part takes ~5 second, and unzip ~15.

We run into that, and the new version will employ multithreading approach for the unzip (meaning the unzipping will happen in the background)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AnxiousSeal95 ! Thank you for the parallel download feature you have added. We have tested it with ClearML 1.5.0 and it seems that it is not really helpful. For a big dataset the time it takes to download does not really changes with the new feature. We indeed can download several chunks in parallel, but it turns out that while N workers are downloading N chunks, downloading speed for each worker is N times less than for a consequent download. Since that most of the chunks have the same size, the download processes finish at the same time and parallel zip extraction process runs, again about N times slower than for a single worker mode. Such an approach does not seem to be the most efficient one. We discussed it with DepressedFish57 and we suggest you do it in another way. We propose to modify the approach as follows: 1) we download first chunk; 2) downloading of the second chunk starts at the same time when extraction of the first chunk begins. It will allow to download each chunk at the maximum possible speed and perform the CPU-dependent extraction processes in parallel, which will indeed enable to download out data faster.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExcitedSeaurchin87
				
					0
					 × 1

DepressedFish57 , Hi 🙂

What do you mean by downloading a previous part of the dataset? get_local_copy fetches the entire dataset if I'm not mistaken. Am I missing something?

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Could you please point me to the piece of ClearML code related to the downloading process?

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExcitedSeaurchin87
				
					0
					 × 1

Write your answer

1K Views

14 Answers

2 years ago