What is retrieved from disk during a query?

The physical storage for rows is described in the docs in Database Page Layout. The column contents for the same row are all stored in the same disk page, with the notable exception of TOAST'ed contents (too large to fit in a page). Contents are extracted sequentially within each row, as explained:

To read the data you need to examine each attribute in turn. First check whether the field is NULL according to the null bitmap. If it is, go to the next. Then make sure you have the right alignment. If the field is a fixed width field, then all the bytes are simply placed.

In the simplest case (no TOAST'ed columns), postgres will fetch the entire row even if few columns are needed. So in this case, the answer is yes, having more columns may have a clear adverse impact on waster buffer cache, particularly if the column contents are large while still under the TOAST threshold.

Now the TOAST case: when an individual field exceeds ~2kB , the engine stores the field contents into a separate physical table. It also comes into play when the entire row doesn't fit into a page (8kB by default): some of the fields are moved to TOAST storage. Doc says:

If it's a variable length field (attlen = -1) then it's a bit more complicated. All variable-length data types share the common header structure struct varlena, which includes the total length of the stored value and some flag bits. Depending on the flags, the data can be either inline or in a TOAST table; it might be compressed, too

TOAST'ed contents are not fetched when they're not explicitly needed, so their effect on the total number of pages to fetch is small (a few bytes per column). This explains the results in @dezso's answer.

As for writes, each row with all its columns is entirely rewritten on each UPDATE, no matter what columns are changed. So having more columns is obviously more costly for writes.

Daniel's answer focuses on the cost of reading individual rows. In this context: Putting fixed size NOT NULL columns first in your table helps a little. Putting relevant columns first (the ones you query for) helps a little. Minimizing padding (due to data alignment) by playing alignment tetris with your columns can help a little. But the most important effect has not been mentioned, yet, especially for big tables.

Additional columns obviously make a row cover more disk space, so that fewer rows fit on one data page (8 kB by default). Individual rows are spread out over more pages. The database engine generally has to fetch whole pages, not individual rows. It matters little whether individual rows are somewhat smaller or bigger - as long as the same number of pages has to be read.

If a query fetches a (relatively) small portion of a big table, where the rows are spread out more or less randomly over the whole table, supported by an index, this will result in roughly the same number of page reads, with little regard to row-size. Irrelevant columns will not slow you down much in such a (rare) case.

Typically, you will fetch patches or clusters of rows that have been entered in sequence or proximity and share data pages. Those rows are spread out due to the clutter, more disk pages have to be read to satisfy your query. Having to read more pages is typically the most important reason for a query to be slower. And that is the most important factor why irrelevant columns make your queries slower.

With big databases, there is typically not enough RAM to keep all of it in cache memory. Bigger rows occupy more cache, more contention, fewer cache hits, more disk I/O. And disk reads are typically much more expensive. Less so with SSDs, but a substantial difference remains. This adds to the above point about page reads.

It may or may not matter if irrelevant columns are TOAST-ed. Relevant columns may be TOAST-ed as well, bringing back much of the same effect.

What is retrieved from disk during a query?

Tags:

Performance

Postgresql

Query Performance

Related

Recent Posts