Character vs Integer primary keys

It depends on your engine. Common wisdom is that reads are cheap, a few bytes here and there will not significantly impact the performance of a small to medium size database.

More importantly, it depends on the uses to which you will put the primary key. Integer serials have the advantage of being simple to use and implement. They also, depending on the specific implementation of the serialization method, have the advantage of being quickly derivable, as most databases just store the serial number in a fixed location, rather than deriving it with Select max(ID)+1 from foo on the fly.

The question becomes: how does a 5 character key present a "meaningful value" to you and to the application? How is this value created, and does it take more or less time than finding an incrementing serial number. While there is a trivial amount of space saved in some integers, the vast majority of systems will ignore this space savings.

There are no performance implications, save that the character scheme requires that there never be an automatic engine, as your "keys" are underivable. For your specific domain, don't bother with artificial keys, and just use Chinese, Japanese and Thai as key names. While you cannot guarantee uniqueness over any possible application, in your scope it is much more reasonable to use them instead of horrible and forced 5-character abbreviations. There are no significant performance impacts until you get to the millions of tuples.

Alternatively, if you're just tracking by country of origin, and not specific regional cuisines (Cantonese, Sichuan, Sicilian, Umbrian, Calabrian, Yucatecan, Oaxacan, etc.), you could always just use ISO 3166 codes.

If I have 10,000 recipes doesn't the difference between a 5-character and 20-character key start to add up?

Space is cheap. When you're talking 10,000,000 recipes that you're doing OLAP operations on, then, maybe. With 10k recipes, you're looking at 150k of space.

But again, it depends. If you have many millions of records, and are doing joins on them, then it makes sense to denormalize the lookup for something this trivial (into a materialized view). For all practical purposes, the relative join efficiency on a modern machine between a 5 character key and variable length key is so similar to be identical. Happily, we live in a world of plentiful CPU and plentiful disk. The nasty ones are too many joins and query inefficiency, rather than character-by-character comparison. With that said, always test.

P&T things of this level are so database-dependent that generalizations are extremely difficult. Build two sample models of the database, populate them with the estimated numbers of records, then see which one is faster. In my experience, character length doesn't make a huge difference compared with good indexes, good memory configurations, and other critical performance tuning elements.

I think, there is not problem with performance for rarely changed table. Maybe you will have problems with design in the future. I suggest you not to use business data as primary key because of business changes. Use any additional primary key to "link" tables in your model. Any business changes will NOT impact on related to this one tables.

The real question is whether DB query performance is at all significant for your application (data size). If your query takes microseconds, saving a few of those microseconds by using Int keys isn't worth the readability/maintainability penalty. However, if your query takes minutes, then saving some of those minutes may be worth the pain of Int keys.

Below is why I think integers can save you query time (as a percentage of your overall query time), but the SkySpark founders can explain it better than me. Full disclosure, my employer pays SkySpark a lot of money to use their DB and I'm trying to build something better/faster.

If you have a lot of sequential data (log files, time series, analytics, text or speech corpora) that have links (relationships) to any of your lookup tables, you'll find that storage space is critical for query speed, despite @Ballsun-Stanton's correct analysis of how cheap space is in $. Because most query time (for sequential data) is spent reading the disk, space isn't cheap in terms of time (as a percent of overall query time). So, unless your RDB automatically and efficiently compresses/decompresses all foreign keys (keys to related records), you'll want all your keys to be Int, which are the most efficient in terms of disk space (and read speed) per unit of information content (entropy). FYI MyISAM in MySql places restrictions on what you can do with compressed data rows (read only). In other words Automatically incremented integers are already compressed as much as is theoretically possible, given the low minimum size limitation on most DB integer fields. And that compression comes without:

query-time compression/decompression penalty
query-time disk read penalty
read-only or other DB restrictions on compressed data records or keys

There's a reason why popular, efficient ORMs like Django default to auto-incrementing integers for PKs and why other SO questions have come to the same conclusion.

Character vs Integer primary keys

Tags:

Mysql

Primary Key

Related

Recent Posts