Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Relating To The

Hi, relating to the Dataset SDK, is there a way to add_external_files with a very large number of urls that are not in the same S3 folder without running into a usage limit due to the state.json file being updated a lot ?

  
  
Posted 2 years ago
Votes Newest

Answers 7


The use case is that we want a dataset of over 100k S3 images and they are scattered all over our bucket due to how we organise those images. If I send an array of URLs as the source_url it will eventually fail after around 40 due to the GCS rate limit for updating the state.json .

  
  
Posted 2 years ago

add_external_files

with a very large number of urls that are

not

in the same S3 folder without running into a usage limit due to the

state.json

file being updated

a lot

?

Hi ShortElephant92
what do you mean the state.json is updated a lot?
I think that everytime you call add_external_files is updated, but add_external_files ` can get a folder to scan, that would be more efficient. How are you using it ?

  
  
Posted 2 years ago

I am adding a 100k S3 images that are not in the same folder and I can't move them in S3. So I have an list of S3 links, not a folder unfortunately.

  
  
Posted 2 years ago

Ohh, yes that makes sense so just send them as a list of links in a single call
dataset.source_url(["s3://", "s3://"], ...)This will be a single update
https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/clearml/datasets/dataset.py#L430

  
  
Posted 2 years ago

AgitatedDove14 I have tried that case, however if you go in the implementation you can see what actually happens is a for loop that will continuously call the method. This is a problem because it will update my external state.json file over 100k times, and the cloud provider will block my requests after around 40 😅

  
  
Posted 2 years ago

If the solution is to create a PR I can definitely do that too!

  
  
Posted 2 years ago

Oh 😢 yes this is not good, let me see if we can quickly fix that

  
  
Posted 2 years ago
902 Views
7 Answers
2 years ago
one year ago
Tags