add_external_files
with a very large number of urls that are
not
in the same S3 folder without running into a usage limit due to the
state.json
file being updated
a lot
?
Hi ShortElephant92
what do you mean the state.json is updated a lot?
I think that everytime you call add_external_files
is updated, but
add_external_files ` can get a folder to scan, that would be more efficient. How are you using it ?
Oh 😢 yes this is not good, let me see if we can quickly fix that
Ohh, yes that makes sense so just send them as a list of links in a single calldataset.source_url(["s3://", "s3://"], ...)
This will be a single update
https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/clearml/datasets/dataset.py#L430
If the solution is to create a PR I can definitely do that too!
AgitatedDove14 I have tried that case, however if you go in the implementation you can see what actually happens is a for
loop that will continuously call the method. This is a problem because it will update my external state.json
file over 100k times, and the cloud provider will block my requests after around 40 😅
The use case is that we want a dataset of over 100k S3 images and they are scattered all over our bucket due to how we organise those images. If I send an array of URLs as the source_url
it will eventually fail after around 40 due to the GCS rate limit for updating the state.json
.
I am adding a 100k S3 images that are not in the same folder and I can't move them in S3. So I have an list of S3 links, not a folder unfortunately.