Can I add a unique constraint that ignores existing violations?

The answer is "yes". You can do this with a filtered index (see here for documentation).

For instance, you can do:

create unique index t_col on t(col) where id > 1000;

This creates a unique index, only on new rows, rather than on the old rows. This particular formulation would allow duplicates with existing values.

If you have just a handful of duplicates, you could do something like:

create unique index t_col on t(col) where id not in (<list of ids for duplicate values here>);

Yes you can do that.

Here is a table with duplicates:

CREATE TABLE dbo.Party
  (
    ID INT NOT NULL
           IDENTITY ,
    CONSTRAINT PK_Party PRIMARY KEY ( ID ) ,
    Name VARCHAR(30) NOT NULL
  ) ;
GO

INSERT  INTO dbo.Party
        ( Name )
VALUES  ( 'Frodo Baggins' ),
        ( 'Luke Skywalker' ),
        ( 'Luke Skywalker' ),
        ( 'Harry Potter' ) ;
GO

Let us ignore existing ones, and ensure that no new duplicates can be added:

-- Add a new column to mark grandfathered duplicates.
ALTER TABLE dbo.Party ADD IgnoreThisDuplicate INT NULL ;
GO

-- The *first* instance will be left NULL.
-- *Secondary* instances will be set to their ID (a unique value).
UPDATE  dbo.Party
SET     IgnoreThisDuplicate = ID
FROM    dbo.Party AS my
WHERE   EXISTS ( SELECT *
                 FROM   dbo.Party AS other
                 WHERE  other.Name = my.Name
                        AND other.ID < my.ID ) ;
GO

-- This constraint is not strictly necessary.
-- It prevents granting further exemptions beyond the ones we made above.
ALTER TABLE dbo.Party WITH NOCHECK
ADD CONSTRAINT CHK_Party_NoNewExemptions 
CHECK(IgnoreThisDuplicate IS NULL);
GO

SELECT * FROM dbo.Party;
GO

-- **THIS** is our pseudo-unique constraint.
-- It works because the grandfathered duplicates have a unique value (== their ID).
-- Non-grandfathered records just have NULL, which is not unique.
CREATE UNIQUE INDEX UNQ_Party_UniqueNewNames ON dbo.Party(Name, IgnoreThisDuplicate);
GO

Let us test this solution:

-- cannot add a name that exists
INSERT  INTO dbo.Party
        ( Name )
VALUES  ( 'Frodo Baggins' );

Cannot insert duplicate key row in object 'dbo.Party' with unique index 'UNQ_Party_UniqueNewNames'.

-- cannot add a name that exists and has an ignored duplicate
INSERT  INTO dbo.Party
        ( Name )
VALUES  ( 'Luke Skywalker' );

Cannot insert duplicate key row in object 'dbo.Party' with unique index 'UNQ_Party_UniqueNewNames'.


-- can add a new name 
INSERT  INTO dbo.Party
        ( Name )
VALUES  ( 'Hamlet' );

-- but only once
INSERT  INTO dbo.Party
        ( Name )
VALUES  ( 'Hamlet' );

Cannot insert duplicate key row in object 'dbo.Party' with unique index 'UNQ_Party_UniqueNewNames'.

The filtered unique index is a brilliant idea but it has a minor disadvantage - no matter if you use the WHERE identity_column > <current value> condition or the WHERE identity_column NOT IN (<list of ids for duplicate values here>).

With the first approach , you will still be able to insert duplicate data in the future, duplicates of existing (now) data. For example, if you have (even only one) row now with CompanyName = 'Software Inc.', the index will not forbid the insertion of one more row with same company name. It will only forbid it if you try twice.

With the second approach there is an improvement, the above will not work (which is good.) However, you will still be able to insert more duplicates or existing duplicates. For example, if you have (two or more) rows now with CompanyName = 'DoubleData Co.', the index will not forbid the insertion of one more row with same company name. It will only forbid it if you try twice.

(Update) This can be corrected if for every duplicate name, you keep out of the exclusion list one id. If, like the above example, there are 4 rows with duplicate CompanyName = DoubleData Co. and IDs 4,6,8,9, the exclusion list should have only 3 of these IDs.

With the second approach another disadvantage is the cumbersome condition (how much cumbersome depends on how many duplicates there are in the first place), since SQL-Server seems to not support the NOT IN operator in the WHERE part of filtered indexes. See SQL-Fiddle. Instead of WHERE (CompanyID NOT IN (3,7,4,6,8,9)), you'll have to have something like WHERE (CompanyID <> 3 AND CompanyID <> 7 AND CompanyID <> 4 AND CompanyID <> 6 AND CompanyID <> 8 AND CompanyID <> 9) I'm not sure if there are efficiency implications with such a condition, if you have hundreds of duplicate names.

Another solution (similar to @Alex Kuznetsov's) is to add another column, populate it with rank numbers and add a unique index including this column:

ALTER TABLE Company
  ADD Rn TINYINT DEFAULT 1;

UPDATE x
SET Rn = Rnk
FROM
  ( SELECT 
      CompanyID,
      Rn,
      Rnk = ROW_NUMBER() OVER (PARTITION BY CompanyName 
                               ORDER BY CompanyID)
    FROM Company 
  ) x ;

CREATE UNIQUE INDEX CompanyName_UQ 
  ON Company (CompanyName, Rn) ;

Then, inserting a row with duplicate name will fail because of the DEFAULT 1 property and the unique index. This is still not 100% foolproof (while Alex's is). Duplicates will still slip in if the Rn is explicitly set in the INSERT statement or if the Rn values are maliciously updated.

SQL-Fiddle-2

Can I add a unique constraint that ignores existing violations?

Tags:

Sql Server

Database Design

Azure Sql Database

Related

Recent Posts