MongoDB: does document size affect query performance?

Just wanted to share my experience when dealing with large documents in MongoDB... don't do it!

We made the mistake of allowing users to include files encoded in base64 (normally images and screenshots) in documents. We ended up with a collection of ~500k documents ranging from 2 Mb to 10 Mb each.

Doing a simple aggregate in this collection would bring down the cluster!

Aggregate queries can be very heavy in MongoDB, especially with large documents like these. Indexes in aggregates can only be used in some conditions and since we needed to $group, indexes were not being used and MongoDB would have to scan all the documents.

The exact same query in a collection with smaller sized documents was very fast to execute and the resource consumption was not very high.

Hence, querying in MongoDB with large documents can have a big impact in performance, especially aggregates.

Also, if you know that the document will continue to grow after it is created (e.g. like including log events in a given entity (document)) consider creating a collection for these child items because the size can also become a problem in the future.

Bruno.


One way to rephrase the question is to say, does a 1 million document query take longer if documents are 16mb vs 16kb each.

Correct me if I'm wrong, from my own experience, the smaller the document size, the faster the query.

I've done queries on 500k documents vs 25k documents and the 25k query was noticeably faster - ranging anywhere from a few milliseconds to 1-3 seconds faster. On production the time difference is about 2x-10x more.

The one aspect where document size comes into play is in query sorting, in which case, document size will affect whether the query itself will run or not. I've reached this limit numerous times trying to sort as little as 2k documents.

More references with some solutions here: https://docs.mongodb.org/manual/reference/limits/#operations https://docs.mongodb.org/manual/reference/operator/aggregation/sort/#sort-memory-limit

At the end of the day, its the end user that suffers.

When I attempt to remedy large queries causing unacceptably slow performance. I usually find myself creating a new collection with a subset of data, and using a lot of query conditions along with a sort and a limit.

Hope this helps!


First of all you should spend a little time reading up on how MongoDB stores documents with reference to padding factors and powerof2sizes allocation:

http://docs.mongodb.org/manual/core/storage/ http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor

Put simply MongoDB tries to allocate some additional space when storing your original document to allow for growth. Powerof2sizes allocation became the default approach in version 2.6, where it will grow the document size in powers of 2.

Overall, performance will be much better if all updates fit within the original size allocation. The reason is that if they don't, the entire document needs to be moved someplace else with enough space, causing more reads and writes and in effect fragmenting your storage.

If your documents are really going to grow in size by a factor of 10X to 20X overtime that could mean multiple moves per document, which depending on your insert, update and read frequency could cause issues. If that is the case there are a couple of approaches you can consider:

1) Allocate enough space on initial insertion to cover most (let's say 90%) of normal documents lifetime growth. While this will be inefficient in space usage at the beginning, efficiency will increase with time as the documents grow without any performance reduction. In effect you will pay ahead of time for storage that you will eventually use later to get good performance over time.

2) Create "overflow" documents - let's say a typical 80-20 rule applies and 80% of your documents will fit in a certain size. Allocate for that amount and add an overflow collection that your document can point to if they have more than 100 friends or 100 Game documents for example. The overflow field points to a document in this new collection and your app only looks in the new collection if the overflow field exists. Allows for normal document processing for 80% of the users, and avoids wasting a lot of storage on the 80% of user documents that won't need it, at the expense of additional application complexity.

In either case I'd consider using covered queries by building the appropriate indexes:

A covered query is a query in which:

all the fields in the query are part of an index, and
all the fields returned in the results are in the same index.

Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query.

Querying only the index can be much faster than querying documents outside of the index. Index keys are typically smaller than the documents they catalog, and indexes are typically available in RAM or located sequentially on disk.

More on that approach here: http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/