What does it mean to "materialize"?

Contrary to popular misconceptions, "materializing" has nothing to do with writing anything to disk or a "storage" layer. During the process of executing a query you have two discrete concepts,

  • Pipelining
  • Materializing

The basic idea here is that given something like a WHERE clause you can either

  • Add selectivity in the retrieval,
  • Add selectivity at the end of the pipeline,

The process is stateless. Ultimately a pipeline gets constructed and somewhere in that pipeline the magic happens (with the general rule being the earlier the better).

Let's look at a sort, how do you sort 1,5,3 in a pipeline? You can't. That means if the table isn't clustered by that field, your only option is to

  • Declare a thing "result set"
  • Finish writing that "result set"
  • Process it (sort in this case)
  • Move on

That "thing", more typically a "materialized result set" is a "relation" and it's typically itself modeled like any other table subject to all of the same operations.

Materialization can be a result of the demands of the problem (like sorting), a shortage of resources, or because of planner limitations -- in PostgreSQL a CTE is an optimization fence. That means optimizations on the outside of the CTE can not be pushed down into the CTE. Why? Because at the point the CTE is done: it's results are in a buffer somewhere (or on disk), the step of pipelining is over.

Real World Validation

  • PostgreSQL, I would urge you to search through the whole codebase here for "materialization". It's used frequently. You can see one of the popular cases in nodeMaterial.c where a call is made to tuplestore_begin_heap to get a place to store a result set
  • MySQL uses it in the context of semijoin, subquery, aggregate, and window function code. Further they have a hint MATERIALIZATION which is documented as such,

    The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.

The video in question.

I would advising ignoring this section of the video. He defines "materialization" as the step from "physical pages to storage devices". This is not a definition in any practice for that term.

  • For the purposes of a DBA, "materialization" means what I said above.
  • For the purposes of chip design and kernel design, there is a difference between Physical Pages and Virtual Pages.

    • Physical Pages are in hardware and contiguous. A "page" is one unit of CPU-accessible memory (4096 bytes on x86). It's called a "Physical Page" because you often can't access it directly, and it's not exposed to you at any level.
    • A logical page is one unit of accessible memory in software that the kernel can get access to.

In the x86 platform for instance, paging is turned in in Long Mode (x86_64) and you can't access Physical Memory without going through a "Logical Page." He calls the process of resolution from Logical Pages to Physical Pages "devirtualization". I've never heard of that and likely neither has he simply because it's a function of the microcode on the CPU and that's all proprietary so naming individual processes that it performs is somewhat useless. For the same reason "materialization" seems equally useless, and it's even more confusing. If you know the physical page how could you not know the storage device? If it's physical, it's not abstracted further. What does the logical<->physical mapping mean in this scheme? It's foreign to me anyway, but maybe it's Oracle parlance or something.

Chosen Answer

In database research the term "materialization" denotes any form of data storage, i.e. any operation that actually sets some bytes on any storage layer eventually. Examples include a deep copy, memory allocation, replication, materialized views (rather than dynamic views), intra-pipeline materializations, but also any form of (partial) copies along the storage hierarchy.

That is untrue in any research or database parlance and I would challenge the statement that all uses of malloc, memcpy or as the author goes on to say fork with (or without) COW as being "Materialization". They may be a "part" of materialization if they otherwise refer to the mechanism I describe above.


My video is about the different mappings that have to be made in order to map relations all the way down to hardware. In practice, these different mappings and linearization steps are often confused and mixed up. This is unfortunate as the different, often hard-coded, decisions taken for certain mapping steps then may hinder query performance later on. Sometimes a simple change in one of these mappings may lead to a completely new product line (e.g. "column stores", PAX/Parquet).

In database research the term "materialization" denotes any form of data storage, i.e. any operation that actually sets some bytes on any storage layer eventually. Examples include a deep copy, memory allocation (not to be confused with malloc()), replication, materialized views (rather than dynamic views), intra-pipeline materializations, but also any form of (partial) copies along the storage hierarchy.

In the video, I introduce (and simplify) the different mapping steps. A simplified view of the world in a database is that everything gets eventually stored to physical pages. physical pages is a fixed term in database research. But make sure you understand that it is merely an abstraction. It is a storage unit in a DBMS. We can safely ignore what happens with those physical pages (for the moment) when discussing certain concepts (like query processing). That is what I do at 9:26 in the video as this is not a course on hardware: I say the data from physical pages gets materialized to storage devices. Again: the latter is a much longer story, e.g. factor in ACID, in particular the "D", recovery, CC, ...

But note that physical pages are not the same as physical memory, rather a physical page is mapped to either a main-memory page (which is almost always a virtual page provided by virtual memory) or mapped to some other device, e.g. pages on a hard disk or SSD. Most devices are virtualized inside as well, e.g. some SSDs used RAID 5 inside.

Of course with virtual memory, snapshotting, and different forms of storage indirection the term is sometimes a bit hard to understand. Sometimes you believe that you materialize, but...

For instance, assume you fork a child process in unix. Looks like the process has physical copies of the data, right? No, it hasn't. Only through copy-on-write you will receive physical copies. So, sometimes the boundary between materialize and not materialize gets blurred.

Hope that helps.

Tags:

Terminology