Do natural keys provide higher or lower performance in SQL Server than surrogate integer keys?

In general, SQL Server uses B+Trees for indexes. The expense of an index seek is directly related to the length of the key in this storage format. Hence, a surrogate key usually outperforms a natural key on index seeks.

SQL Server clusters a table on the primary key by default. The clustered index key is used to identify rows, so it gets added as included column to every other index. The wider that key, the larger every secondary index.

Even worse, if the secondary indexes are not explicitly defined as UNIQUE the clustered index key automatically becomes part of the key of each of those. That usually applies to most indexes, as usually indexes are declared as unique only when the requirement is to enforce uniqueness.

So if the question is, natural versus surrogate clustered index, the surrogate will almost always win.

On the other hand, you are adding that surrogate column to the table making the table in itself bigger. That will cause clustered index scans to get more expensive. So, if you have only very few secondary indexes and your workload requires to look at all (or most of the) rows often, you actually might be better of with a natural key saving those few extra bytes.

Finally, natural keys often make it easier to understand the data model. While using more storage space, natural primary keys lead to natural foreign keys which in turn increase local information density.

So, as so often in the database world, the real answer is "it depends". And - always test in your own environment with realistic data.


I believe, that the best lies in the middle.

Natural keys overview:

  1. They are make data model more obvious because they are came from subject area, and not from somebody's head.
  2. Simple keys (one column, between CHAR(4) and CHAR(20)) are saving some extra bytes, but you need to watch for their consistency (ON UPDATE CASCADE becomes critical for those keys, that might be changed).
  3. A lot of cases, when natural keys are complex: consists of two or more columns. If such key might migrate to another entity as a foreing key, then it will add data overhead (indices and data columns might become large) and performance loose.
  4. If key is a large string, then it probably always will loose to an integer key, because simple search condition becames a byte array comparison in a database engine, which in most cases is slower, than integer comparison.
  5. If key is a multilanguage string then need to watch the collations also.

Benefits: 1 and 2.

Watchouts: 3, 4 and 5.


Artificial identity keys overview:

  1. You do not need to bother about their creation and handling (in most cases) as this feature handled by database engine. They are unique by default and doesn't take a lot of space. Custom operations like ON UPDATE CASCADE might be ommited, because key values not changing.

  2. They (often) are best candidates for migration as a foreign keys because:

    2.1. consists of one column;

    2.2. using a simple type which has a small weight and acts fast for comparison operations.

  3. For an association entities, which keys are not migrate anywhere, it might become a pure data overhead, as it usefulness is lost. Complex natural primary key (if there are no string columns there) will be more useful.

Benefits: 1 and 2.

Watchouts: 3.


CONCLUSION:

Arificial keys are more maintainable, reliable and fast because they have been designed for this features. But in some cases are not needed. For example, single CHAR(4) column candidate in most cases behaves like INT IDENTITY. So there is another question here also: maintainability + stability or obviousness?

Question "Should I inject an artificial key or not?" always depends on natural key structure:

  • If it contains a large string, then it is slower and will add data overhead if migrating as foreign to another entity.
  • If it consists of multiple columns, then it is slower and will add data overhead if migrating as foreign to another entity.

A key is a logical feature of a database whereas performance is always determined by physical implementation in storage and by physical operations run against that implementation. It's therefore a mistake to attribute performance characteristics to keys.

In this particular example however, two possible implementations of tables and queries are compared to each other. The example does not answer the question being posed in the title here. The comparison being made is of joins using two different datatypes (integer and character) using just one type of index (B-tree). An "obvious" point is that if a hash index or other type of index been used there would quite possibly be no measurable performance difference between the two implementations. There are more fundamental problems with the example however.

Two queries are being compared for performance but the two queries are not logically equivalent because they return different results! A more realistic test would compare two queries returning the same results but using different implementations.

The essential point about a surrogate key is that it is an extra attribute in a table where the table also has "meaningful" key attributes used in the business domain. It is the non-surrogate attributes that are of interest for query results to be useful. A realistic test therefore would compare tables using only natural keys with an alternative implementation having both natural and surrogate keys in the same table. Surrogate keys typically require additional storage and indexing and by definition require additional uniqueness constraints. Surrogates require additional processing to map the external natural key values onto their surrogates and vice versa.

Now compare this potential query:

A.

SELECT t2.NaturalTable2Key, t2.NaturalTable1Key
FROM Table2 t2;

To its logical equivalent if the NaturalTable1Key attribute in Table2 is replaced with the surrogate IDTable1Key:

B.

SELECT t2.NaturalTable2Key, t1.NaturalTable1Key
FROM Table2 t2
INNER JOIN Table1 t1
ON t1.IDTable1Key = t2.IDTable1Key;

Query B requires a join; Query A does not. This is a familiar situation in databases that (over)use surrogates. Queries become needlessly complex and much harder to optimise. Business logic (especially data integrity constraints) becomes more difficult to implement, test and verify.