Hi @<1590152201068613632:profile|StaleLeopard22> , this might indeed be the list of files, and if so, this is simply a bug (since the mondogb document is indeed limited) - in such a case we should simply truncate the list somehow - can you provide more info on the number and nature of the files? If you can, I'd appreciate a GitHub issue so we can fix this properly
Thanks for your reply 🙂
We worked around the bug by only calling Dataset.add_files
once per folder that contains files (~120) using a wildcard, rather than for each individual file (~75,000)
I am unsure what effect this has, but I assume some log or other metadata was being created by the add_files
method, and calling it less times made the mongodb document smaller?
Mongo has a way to store documents larger than the 16MB limit using GridFS which may be the solution for large documents, or perhaps an optimisation to reduce the size of this document.
I will create an issue, working on a code snippet that demonstrates the issue in a repeatable way with dummy data.
We are working with a custom dataset made up of numpy files that contain audio features. We have 75,000 files in this particular dataset. Each file is about 500kB max
The bug seems to be related to the number of times add_files
is called rather than the size or number of files