When should a primary key be meaningful?

under what circumstances, if ever, is preferable to use a primary key with some other real meaning? (emphasis added)

Given that the focus of this question is "preferable" and not "acceptable", and accepting that this is still a highly subjective topic, I will say that I cannot think of a situation where it is best for the system to have a truly natural key for a variety of reasons (most of which has been said before in other answers that Paul linked to in a comment on the question):

What is thought to be unique is not always unique (e.g. Social Security Numbers / SSNs in the U.S.)
Sometimes things change, either in value or uniqueness (we don't control the external world)
Even when something should be "stable" in value and uniqueness (e.g. SKU, perhaps), can the incoming value be guaranteed to be correct? Humans often mistype stuff doing data entry. There are also bugs in export processes that might cause values in a file imported by your system to be incorrect. There are also bugs in other systems that feed data into yours that can allow for the data itself to not be entirely correct, even if their export process worked correctly.

I emphasize "truly" because there are two situations where I prefer to not have a new surrogate key:

Bridge Tables

(or whatever you like to call tables used only, or mainly, to represent many-to-many relationships)

Thing                          ThingXTag                Tag
------                         ---------                ---
ThingID INT AutoMagic PK --->  ThingID INT PK, FK      
Stuff   SomeType               TagID   INT PK, FK  <--- TagID   INT AutoMagic PK
                                                        TagName VARCHAR

When modeling a bridge table (a table which doesn't exist in the logical model but is needed in the physical model), the PK there should be the existing Primary Key columns of the tables being related via this table. This allows for enforcing the proper uniqueness and non-NULL-ness of the values without needing a separate unique index / constraint. In the rare case of needing to foreign key to this relationship, it will:

be meaningful in that each of the key columns will actually point back to the original source tables without needing yet another join, and
prevent someone or something from updating the key columns that form the relationship between the two primary tables without updating the value being linked to via a foreign key.

WackyTable                           ThingXTag
----------                           ---------
WackyTableID INT AutoMagic PK
ThingID      INT FK            --->  ThingID     INT PK, FK (to Thing.ThingID)
TagID        INT FK            --->  TagID       INT PK, FK (to Tag.TagID)
AttributeX   VARCHAR                 InsertDate  DATETIME
InsertDate   DATETIME

I have worked on a system where these bridge tables had their own auto-incrementing, surrogate key PKs, and the single column surrogate key of the bridge table was referenced in other tables via FK:

WackyTable                           ThingXTag
----------                           ---------
WackyTableID INT AutoMagic PK        
ThingXTagID  INT FK            --->  ThingXTagID INT AutoMagic PK
AttributeX   VARCHAR                 ThingID     INT FK (to Thing.ThingID)
InsertDate   DATETIME                TagID       INT FK (to Tag.TagID)
                                     InsertDate  DATETIME

It was a horrible, confusing mess that we wasted way too much time on for debugging, etc.

Sibling Tables

These are tables that are truly a single entity and thus have a 1-to-1 relationship. They are only split into two (or more, as is needed) tables for performance reasons. I have done this with tables having 1 million (or more) rows where it was either very wide, or if it was moderately wide and there were columns that were either not used very frequently or were string of over 50 bytes. Stuff like that. This keeps the core properties of the entity in a narrower table that fits more rows on each data page.

In these cases, the "sibling" table is at the exact same level as the initial table and should have the same PK. There is no use in giving it an auto-incrementing surrogate key since each row has what amounts to a natural key from the initial table.

Product                            ProductProperty
-------                            ---------------
ProductID  INT AutoMagic PK  --->  ProductID        INT PK, FK (to Product.ProductID)
Name       VARCHAR                 ShortDescription VARCHAR
SKU        VARCHAR                 SomethingElse    SomeType
Quantity   INT                     UpdateDate       DATETIME
CreateDate DATETIME
UpdateDate DATETIME

To be clear, I am speaking in terms of the physical model, not the conceptual model. I assumed that the focus of this question was the physical model since it is framed in the context of issues that do not exist conceptually: surrogate keys, issues dealing with usage of the primary key value, etc. With that in mind, I was not implying that natural keys should not be stored and used for identification. On the contrary, natural keys are great "alternate keys" and should have unique constraints / indexes placed on them. However, the idealism of the conceptual model does not always translate directly to the physical model. Data integrity (i.e. the stability and reliability of the data model) is a top, if not the top, priority of the physical model. And so practical considerations, such as using surrogate keys, must be made to ensure that this goal is met and not compromised. Meaning, if you have SSNs or SKUs, etc, then absolutely store them in a column that has a unique constraint on it, and have the system do a lookup on that value since auto-generated numbers shouldn't be used externally anyway. Users don't need to know the auto-generated ID number of a record: they should pass in a value that they do know (e.g. email address as a lookup for UserID / CustomerID, flight confirmation codes in combination with flight dates, etc) and the system should translate that into the auto-generated value that it uses from that point forward.

Yes, the problems noted at the beginning of this answer are still potential issues when using natural keys as alternate keys. However, the difference is that the problem is isolated to (usually) just one table. If someone made a mistake and created a unique index on "flight locator", it might be a while before they get a violation. But once they do, it is easy enough to drop that unique index and recreate it to include the flight dates as well. Or, if you change your email address (often used as the Login) on a system and get an error because it was in use by someone else years ago (legitimately), that can most likely be handled by support without any impact / risk to existing related records. In both cases the rest of the data model is untouched for the necessary changes.

Again, this is a pragmatic approach to:

minimizing potential data loss that can happen when migrating a primary key that has been referenced by one or more foreign keys (and minimizing the scope of the project also reduces the maintenance window :-). While not all PKs (or Unique Constraints/Indexes) have FKs, using a surrogate should certainly reduce the number of columns that have such dependencies
making the system as durable, resilient, and efficient as possible. The physical model is the "in practice" to the "in theory" of the conceptual model. And we already make quite a few adjustments and considerations given that the conceptual model doesn't care about implementation. Yet we make choices to utilize vendor-specific features, for performance (e.g. denormalization, lookup tables, "sibling" tables as noted above, etc), to create the "bridge" tables (as noted above), and so on.

I don't know how many systems used SSNs (Social Security Numbers in the U.S.) as PKs, but for any that did, some of them (many perhaps) may have avoided issues with them not being as unique as they should have been. But, none of those systems were able to avoid changes over the years regarding the need to handle them much more securely. Systems treating SSNs as an alternate key require very little development time to switch over to encrypting those values, and the system requires little downtime (or none) to make the changes at the data layer. Given that we all have backlogs of projects that we will likely never get to, businesses tend to prefer that these annoying yet inevitable changes will cost them 5 hours instead of 20 - 40 (don't forget that changes need to be tested, and hence the scope of the changes also directly impacts how much QA time is needed for the project).

To be explicit, there are some scenarios where it is "acceptable" to have natural keys, though I don't think I would go so far as to say "preferable".

State / Region and Country codes: If you use codes maintained by the International Organization for Standardization (ISO) (e.g. "US" = United States, "FL" = Florida, etc) then these are probably reliable enough to use, and I have used them as they are short (i.e. not bad for performance) and human-readable. These codes are also used often-enough in a variety of other ways, even outside of computer systems, that people have a general familiarity with them, even ones that don't initially make as much sense to some folks (e.g. "DE" = Germany might not be intuitively obvious to those who are unaware that "Germany" == "Deutschland" in German).
Internal lookup value codes: You can't control external sources, but (hopefully) you are in control of your own system. If your system has department codes, status codes, etc that are used internally, then it should be fine to come up with short codes for them (2 - 4 bytes). At 4 bytes it would be using the same amount of space as an INT and if using a binary Collation (one ending in _BIN2, or even _BIN, but _BIN2 is preferred) then it should compare just as fast. Having relatively meaningful values for such codes can make support / debugging easier. However, you still get into the situation where over time, department names, etc might change and the codes might not be meaningful anymore.

All keys of a table identify rows in that table. DBMSs will enforce uniqueness constraints on keys (AKA candidate keys) which guarantees their uniqueness and thereby guarantees that rows are identifiable by their key attributes. Your second paragraph suggests a misconception that only the primary key performs the role of identification.

To make effective use of a database its users normally need some way of relating information in the database to real world objects or concepts. Keys enable them to do that by ensuring that facts in the database can be individually identified and related back to the real world things they describe. For a database to model the real world accurately, all, or almost all, tables will require "natural" keys - keys that are meaningful, in the sense that they are found and used and have relevance outside the database and not just within it.

A primary key is not a special type of key. It is one of the keys of a table. If a table has only one key then a choice of primary key does not arise. If a table has more than one key then a "primary" key is usually designated for reasons of convention or preference but different individuals' reasons for choosing a primary key are various, subjective and sometimes contradictory. In all cases the choice of the keys is what matters; how you use the keys matters; the choice of a "primary" key is generally far less important.

My opinion is "almost never". If something has meaning, then there's the potential for change if reality ever diverges from the data, and then you have to deal with cascading updates in spite of the foreign keys, and other such messiness. Probably at least 99% of the time, I opt for adding a surrogate identity column as the primary key. The remaining 1% of the time generally consists of related tables that have composite keys made up of some combination of surrogate keys from other tables and/or time stamps (and perhaps an identity column of its own in rare cases where it's appropriate).

If you have natural keys where you still want to enforce uniqueness and convey the semantics to those working with the data, consider creating additional unique, alternate key indexes. I usually prefix the index names with AK_ to make the intent obvious. The index will benefit lookup queries, maintain data integrity, and make it clear that only one row should be expected from a query on a specific key value.

Meanwhile, all of your sales documents can be joined based on the meaningless surrogate key, and you don't have to deal with cascading name changes all over the place if one of the ladies in sales gets married (been down that road many times).

When should a primary key be meaningful?

Tags:

Database Design

Primary Key

Candidate Key

Related

Recent Posts