Should I add an auto-increment / IDENTITY field to a cross-reference table just for PK purposes?

One thing to consider is that a Primary Key and a Clustered Index are not the same thing. A Primary Key is a constraint and deals with the rules by which the data lives (i.e. data integrity); it has nothing to do with efficiency / performance. A Primary Key requires that the key column(s) be unique (in combination) and NOT NULL (individually). A PK is enforced via a Unique Index, though it can be either Clustered or Non-Clustered.

A Clustered Index is a means of physically (i.e. on disk) ordering the data in the table and deals with performance; it has nothing to do with data integrity. A Clustered Index can require that the key column(s) be unique (in combination), but it doesn't need to. However, since the Clustered Index is the physical order of the data, it needs to uniquely identify each row no matter what. So if you don't set it to require uniqueness, it will create its own uniqueness via a hidden 4-byte "uniquifier" column. That column is always there in non-Unique Clustered Indexes, but it doesn't take up any space when the key fields are unique (in combination). To see first hand how this "uniquifier" column works (both in the Clustered Index and the effect on Non-Clustered Indexes), please check out this test script I posted on PasteBin: T-SQL script to test Uniquifier size.

Hence, the main question of:

would it be more efficient to add an auto-increment id field and use that in conjunction with company_id as the primary key, or would it add unnecessary overhead

is conflating those two concepts, so they need to be addressed separately, though there is definitely some overlap.

Should an IDENTITY column be added or would it be unnecessary overhead?

If you add an INT IDENTITY column and use it to create a PK, assuming it would be a Clustered PK, that adds 4 bytes to every row. This column is visible and usable in queries. It could be added to other tables as a Foreign Key, though in this particular case that won't happen.

If you don't add the INT IDENTITY column, then you can't create a PK on this table. However, you can still create a Clustered Index on the table as long as you don't use the UNIQUE option. In this case, SQL Server will add a hidden column called "uniquifier" which behaves as described above. Because the column is hidden, it cannot be used in queries or as a reference for Foreign Keys.

So as far as efficiency goes, these options are roughly the same. Yes, there will be slightly less space taken up by having the non-Unique Clustered Index due to some rows (ones with the initial unique key values) taking up 0 bytes while all rows in the IDENTITY / PK will take the 4 bytes. But there won't be enough of the 0 byte rows (especially with the small amount of rows expected) to ever notice a difference, let alone out-weigh the convenience of being able to use the ID column in queries.

INT IDENTITY Column or Hash of org_path Persisted Computed Column?

Given that you won't be looking up rows based on org_path values, then it doesn't make sense to add the overhead of the Persisted Computed Column plus needing to compute that hash in queries in order to match against the Computed Column (this was my original suggestion, available in the revision history here, which was based on the initial wording / details of the Question). In this particular case, the INT IDENTITY "ID" Column is probably best.

Key Column Order

Given that the ID Column will rarely, if ever, be used in queries, and given that the two main use-cases are to get either "all rows" or "all rows for a given company_id", I would create the PK on company_id, id. And because this means that rows are not inserted sequentially, I would specify a FILLFACTOR of 90. You will also need to make sure to do regular index maintenance to reduce the fragmentation.

Second Question

does the fact that company_id is the primary key in another table have any effect here

No.

Trigger

Since org_path values within a company_id are unique, you should still create a Trigger on INSERT, UPDATE to enforce this. In the Trigger, do an IF EXISTS with a query that probably does a COUNT(*) and GROUP BY company_id, org_path. If anything is found, issue a ROLLBACK to cancel the DML operation and then a RAISERROR saying that there are duplicates.

Collation

In my initial answer (based on original wording / sparse details of the question, and available in the revision history here), I had suggested possibly using a binary (i.e. _BIN2) Collation. Now that we have insight into what exactly org_path is, I would not recommend using a binary Collation. Since there will be diacritical marks, you do want to make use of linguistic equivalences.

For a non-unique clustered index on comany_id alone, SQL Server will automatically add a 4 byte integer uniqueifier to all duplicate (i.e. second and subsequent for a key value) clustered index keys to make it unique. This is not exposed to the user though.

The advantage of adding your own unique identifier as a secondary key column is that you can then still seek by company_id but also seek to individual rows more efficiently (using company_id, identitycol rather than company_id with a residual predicate on org_path). The clustered index would then be unique on company_id, identitycol, so no hidden uniqueifiers would be added.

Also, if you end up with duplicates for (company_id,org_path), having the explicit identity column (a sort of "exposed uniqueifier") will make it easier to target just one of them for delete or update.

Should I add an auto-increment / IDENTITY field to a cross-reference table just for PK purposes?

Tags:

Sql Server

Primary Key

Clustered Index

T Sql

Related

Recent Posts