Flag vs table split

a table of items which will (potentially) contain tens of millions of records.

That's actually not that much, given what SQL Server can efficiently handle. Of course, I remember one of my earlier jobs where one of the largest tables (a single-instance system) had 2 million rows and that was the most I had ever dealt with. Then the next job had 17 Production instances with some tables having hundreds of millions of rows, and that all got aggregated into a Data Warehouse with multiple fact tables having over 1 billion rows. Don't get me wrong, I am not scoffing at tens of millions of rows, I am just emphasizing that with a good data model and proper indexing (and index maintenance), SQL Server can handle a lot.

Up to 50% of items may be "unapproved" at any given time.

Hmm. That doesn't sound right. The rate of "approving" entries will be half the rate of getting new entries? For every 2 new entries, only 1 will be "approved"? In your example of 2 million rows, and 1 million each for "approved" and "unapproved", a few years later with another 10 million entries, you expect 6 million each for "approved" and "unapproved"? Or is it that the 1 million "unapproved" will remain somewhat constant, such that with 10 million new entries, there will be 11 million "approved" and still 1 million "unapproved"?

Records may become "approved", but not vice versa.

That is true today, but things change over time and so there is always the possibility that the business could decide to allow for "unapproving", or maybe some other status, such as "archived",etc.

So, let's look at the choices:

Flag (or possibly even TINYINT "status")

  • Slightly slower for queries of each status
  • More flexible over time / easy to incorporate a change such as a third state (e.g. "Archived") with only a new Lookup status value. No new table (necessarily), some new code, only some code updated.
  • Less work (i.e. code, testing, etc) and less room for error updating a single TINYINT column
  • Less complicated = lower maintenance costs over time, shorter training time for new employees to figure out
  • (possibly) Smaller impact to Transaction Log as one table is updated
  • Just need a Lookup table for "RecordStatus" and FK between the two tables.

Two separate tables (one for "approved", one for "unapproved")

  • Slightly faster for queries of each status
  • Less flexible over time / harder to incorporate a change such as a third state (e.g. "Archived"); new state would require most likely another table, and definitely new and updated code.
  • More work (i.e. code, testing, etc) and more room for error moving records from "Unapproved" table to "Approved" table
  • More complicated = higher maintenance costs over time, longer training time for new employees to figure out
  • (possibly) Greater impact to Transaction Log as one table is deleted and one is inserted
  • No need to worry about "renewal of item's ID": Unapproved table has ID column that is an IDENTITY column, and Approved table has ID column that is not an IDENTITY (as it is not needed there). Hence ID values remain consistent as record moves between tables.

Personally, I would lean towards the single table with StatusID column to start with. Using two tables seems like an over-complicated, premature optimization. That type of optimization can be discussed if / when the number of records is in several hundreds of millions and indexing does not provide any performance gains.


You can have it both ways with partitioned views.

You create an underlying table for each status, enforced by constraints, with mutually exclusive values. Then a view which UNIONs together the underlying tables. The view or each base table can be referenced explicitly. If a row's status is UPDATEd through the view the DBMS will DELETE it from one base table and insert it into the one corresponding to the new status. Each base table can be indexed independently according to its usage pattern. The optimiser will resolve index references to a single corresponding base table if it can.

The benefits are
a) shallower indexes. Do the math on the index fan-out, however. At that scale and split between your status values it is possible the indexes will be the same depth on the split tables as they would be on the combined table.
b) no application code has to change. The data continues to appear as a continuous whole.
c) future new status values can be included by adding a new base table, with constraint, and re-creating the view.

The cost is all that data movement; two pages and associated indexes are written for each status update. Lots of IO to deal with. That much movement will cause fragmentation, too.