Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Folks, Tldr: Dataset.Remove_Files() Is Very Slow. How Can I Speed It Up? I'M Working With A Large Raw Dataset That We Are Trying To Use A Small Subset Of. The Data Is Thousands Of Images And A Metadata Json File For Each Image. To Create This Subset

Hi folks,

TLDR: Dataset.remove_files() is very slow. How can I speed it up?

I'm working with a large raw dataset that we are trying to use a small subset of. The data is thousands of images and a metadata json file for each image. To create this subset of the raw data we are using the dataset.remove_files() function, which seems to upload a state dict after every file change. Is there a way to batch this state dict change or something similar? We are passing individual files(is there another way?) to remove_files(), which are stored in S3.

Appreciate the help

  
  
Posted 23 days ago
Votes Newest

Answers 4


The way I wrote it is a bit of a quick fix with a lot of code duplication, I'm sure it could be implemented in a cleaner way (e.g. having only one remove_files method that can either take a single path or a list of paths).
It's one of those things that I intended to do at some point, but never had the time to clean it up (I did a similar modification for adding lists of files, since this has exactly the same issue if you don't want to add something you can define with a wildcard but only specific files).
If you're up for it, feel free - I'm sure there are plenty of people who would appreciate it.

  
  
Posted 23 days ago

Seems to be working. Is there any reason not to add this to the main repo?

  
  
Posted 23 days ago

I will give it a shot. Thanks!

  
  
Posted 23 days ago

Hi Allen,
I've ran into this exact problem myself, and simply added a function to dataset.py in the clearml package ( clearml/datasets/dataset.py ) that takes a list of files instead of a single file.

It looks like this (I use clearml 1.13.1 ):

  
  
Posted 23 days ago
110 Views
4 Answers
23 days ago
22 days ago
Tags