Hi All, Is There A Limit To The Maximum Size, Or Number Of Files A Dataset Can Have When Uploading To Clearml Self-Hosted? We Got This Error When

Answered

Hi all, Is there a limit to the maximum size, or number of files a dataset can have when uploading to ClearML self-hosted?
We got this error when finalizing the uploaded dataset

Action failed <500/0: events.add_batch/v1.0 (Update failed (Resulting document after update is larger than 16777216, full error: {'index': 0, 'code': 17419, 'errmsg': 'Resulting document after update is larger than 16777216'}))>

Which i have found out is a MongoDB error, something is hitting the maximum document size.
The dataset files themselves are stored on s3, so no size limit there.

So I am assuming this is some dataset metadata, perhaps a list of files or hashes or something?
Has anyone seen this before, and any tips on how to work around it?

When creating the dataset, we are iterating through a list of files and running Dataset.add_files for each individual file as they are located in different paths, not all in the one folder, could this be the culprit?

  				
Posted 
	one year ago

					More  		
  Report
		
					StaleLeopard22
				
					0
					 × 1

Votes Newest

Answers 2

Hi StaleLeopard22 , this might indeed be the list of files, and if so, this is simply a bug (since the mondogb document is indeed limited) - in such a case we should simply truncate the list somehow - can you provide more info on the number and nature of the files? If you can, I'd appreciate a GitHub issue so we can fix this properly

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks for your reply 🙂

We worked around the bug by only calling Dataset.add_files once per folder that contains files (~120) using a wildcard, rather than for each individual file (~75,000)

I am unsure what effect this has, but I assume some log or other metadata was being created by the add_files method, and calling it less times made the mongodb document smaller?

Mongo has a way to store documents larger than the 16MB limit using GridFS which may be the solution for large documents, or perhaps an optimisation to reduce the size of this document.

I will create an issue, working on a code snippet that demonstrates the issue in a repeatable way with dummy data.

We are working with a custom dataset made up of numpy files that contain audio features. We have 75,000 files in this particular dataset. Each file is about 500kB max

The bug seems to be related to the number of times add_files is called rather than the size or number of files

  				
Posted 
	one year ago

					More  		
  Report
		
					StaleLeopard22
				
					0
					 × 1

Write your answer

1K Views

2 Answers

one year ago