Hi Folks, Tldr: Dataset.Remove_Files() Is Very Slow. How Can I Speed It Up? I'M Working With A Large Raw Dataset That We Are Trying To Use A Small Subset Of. The Data Is Thousands Of Images And A Metadata Json File For Each Image. To Create This Subset

Answered

Hi folks,

TLDR: Dataset.remove_files() is very slow. How can I speed it up?

I'm working with a large raw dataset that we are trying to use a small subset of. The data is thousands of images and a metadata json file for each image. To create this subset of the raw data we are using the dataset.remove_files() function, which seems to upload a state dict after every file change. Is there a way to batch this state dict change or something similar? We are passing individual files(is there another way?) to remove_files(), which are stored in S3.

Appreciate the help

  				
Posted 
	one year ago

					More  		
  Report
		
					DisgustedSquid10
				
					0
					 × 1

Votes Newest

Answers 4

The way I wrote it is a bit of a quick fix with a lot of code duplication, I'm sure it could be implemented in a cleaner way (e.g. having only one remove_files method that can either take a single path or a list of paths).
It's one of those things that I intended to do at some point, but never had the time to clean it up (I did a similar modification for adding lists of files, since this has exactly the same issue if you don't want to add something you can define with a wildcard but only specific files).
If you're up for it, feel free - I'm sure there are plenty of people who would appreciate it.

  				
Posted 
	one year ago

					More  		
  Report
		
					JealousMole49
				
					0
					 × 1

Seems to be working. Is there any reason not to add this to the main repo?

  				
Posted 
	one year ago

					More  		
  Report
		
					DisgustedSquid10
				
					0
					 × 1

I will give it a shot. Thanks!

  				
Posted 
	one year ago

					More  		
  Report
		
					DisgustedSquid10
				
					0
					 × 1

Hi Allen,
I've ran into this exact problem myself, and simply added a function to dataset.py in the clearml package ( clearml/datasets/dataset.py ) that takes a list of files instead of a single file.

It looks like this (I use clearml 1.13.1 ):

  				
Posted 
	one year ago

					More  		
  Report
		
					JealousMole49
				
					0
					 × 1

Write your answer

954 Views

4 Answers

one year ago