with a very large number of urls that are
in the same S3 folder without running into a usage limit due to the
file being updated
what do you mean the state.json is updated a lot?
I think that everytime you call
is updated, but add_external_files ` can get a folder to scan, that would be more efficient. How are you using it ?
Oh 😢 yes this is not good, let me see if we can quickly fix that
Ohh, yes that makes sense so just send them as a list of links in a single call
dataset.source_url(["s3://", "s3://"], ...)This will be a single update
If the solution is to create a PR I can definitely do that too!
The use case is that we want a dataset of over 100k S3 images and they are scattered all over our bucket due to how we organise those images. If I send an array of URLs as the
source_url it will eventually fail after around 40 due to the GCS rate limit for updating the
AgitatedDove14 I have tried that case, however if you go in the implementation you can see what actually happens is a
for loop that will continuously call the method. This is a problem because it will update my external
state.json file over 100k times, and the cloud provider will block my requests after around 40 😅
I am adding a 100k S3 images that are not in the same folder and I can't move them in S3. So I have an list of S3 links, not a folder unfortunately.