Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Hi All, Is There A Limit To The Maximum Size, Or Number Of Files A Dataset Can Have When Uploading To Clearml Self-Hosted? We Got This Error When

Hi all, Is there a limit to the maximum size, or number of files a dataset can have when uploading to ClearML self-hosted?
We got this error when finalizing the uploaded dataset

Action failed <500/0: events.add_batch/v1.0 (Update failed (Resulting document after update is larger than 16777216, full error: {'index': 0, 'code': 17419, 'errmsg': 'Resulting document after update is larger than 16777216'}))>

Which i have found out is a MongoDB error, something is hitting the maximum document size.
The dataset files themselves are stored on s3, so no size limit there.

So I am assuming this is some dataset metadata, perhaps a list of files or hashes or something?
Has anyone seen this before, and any tips on how to work around it?

When creating the dataset, we are iterating through a list of files and running Dataset.add_files for each individual file as they are located in different paths, not all in the one folder, could this be the culprit?

Posted one year ago
Votes Newest

Answers 2

Thanks for your reply 🙂

We worked around the bug by only calling Dataset.add_files once per folder that contains files (~120) using a wildcard, rather than for each individual file (~75,000)

I am unsure what effect this has, but I assume some log or other metadata was being created by the add_files method, and calling it less times made the mongodb document smaller?

Mongo has a way to store documents larger than the 16MB limit using GridFS which may be the solution for large documents, or perhaps an optimisation to reduce the size of this document.

I will create an issue, working on a code snippet that demonstrates the issue in a repeatable way with dummy data.

We are working with a custom dataset made up of numpy files that contain audio features. We have 75,000 files in this particular dataset. Each file is about 500kB max

The bug seems to be related to the number of times add_files is called rather than the size or number of files

Posted one year ago

Hi @<1590152201068613632:profile|StaleLeopard22> , this might indeed be the list of files, and if so, this is simply a bug (since the mondogb document is indeed limited) - in such a case we should simply truncate the list somehow - can you provide more info on the number and nature of the files? If you can, I'd appreciate a GitHub issue so we can fix this properly

Posted one year ago