Hi, Relating To The

Answered

Hi, Relating To The

Hi, relating to the Dataset SDK, is there a way to add_external_files with a very large number of urls that are not in the same S3 folder without running into a usage limit due to the state.json file being updated a lot ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ShortElephant92
				
					0
					 × 1

Votes Newest

Answers 7

I am adding a 100k S3 images that are not in the same folder and I can't move them in S3. So I have an list of S3 links, not a folder unfortunately.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ShortElephant92
				
					0
					 × 1

add_external_files

with a very large number of urls that are

not

in the same S3 folder without running into a usage limit due to the

state.json

file being updated

a lot

?

Hi ShortElephant92
what do you mean the state.json is updated a lot?
I think that everytime you call add_external_files is updated, but add_external_files ` can get a folder to scan, that would be more efficient. How are you using it ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh 😢 yes this is not good, let me see if we can quickly fix that

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ohh, yes that makes sense so just send them as a list of links in a single call
dataset.source_url(["s3://", "s3://"], ...)This will be a single update
https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/clearml/datasets/dataset.py#L430

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If the solution is to create a PR I can definitely do that too!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ShortElephant92
				
					0
					 × 1

AgitatedDove14 I have tried that case, however if you go in the implementation you can see what actually happens is a for loop that will continuously call the method. This is a problem because it will update my external state.json file over 100k times, and the cloud provider will block my requests after around 40 😅

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ShortElephant92
				
					0
					 × 1

The use case is that we want a dataset of over 100k S3 images and they are scattered all over our bucket due to how we organise those images. If I send an array of URLs as the source_url it will eventually fail after around 40 due to the GCS rate limit for updating the state.json .

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ShortElephant92
				
					0
					 × 1

Write your answer

898 Views

7 Answers

2 years ago

one year ago