Solution to assigning unique values to rows with finite collaboration distance

This problem is about following links between items. This puts it in the realm of graphs and graph processing. Specifically, the whole dataset forms a graph and we are looking for components of that graph. This can be illustrated by a plot of the sample data from the question.

enter image description here

The question says we can follow GroupKey or RecordKey to find other rows that share that value. So we can treat both as vertexes in a graph. The question goes on to explain how GroupKeys 1–3 have the same SupergroupKey. This can be seen as the cluster on the left joined by thin lines. The picture also shows the two other components (SupergroupKey) formed by the original data.

SQL Server has some graph processing ability built into T-SQL. At this time it is quite meagre, however, and not helpful with this problem. SQL Server also has the ability to call out to R and Python, and the rich & robust suite of packages available to them. One such is igraph. It is written for "fast handling of large graphs, with millions of vertices and edges(link)."

Using R and igraph I was able to process one million rows in 2 minutes 22 seconds in local testing1. This is how it compares against the current best solution:

Record Keys     Paul White  R               
------------    ----------  --------
Per question    15ms        ~220ms
100             80ms        ~270ms
1,000           250ms       430ms
10,000          1.4s        1.7s
100,000         14s         14s
1M              2m29        2m22s
1M              n/a         1m40    process only, no display

The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.

When processing 1M rows, 1m40s were used to load & process the graph, and to update the table. 42s were required to populate an SSMS result table with the output.

Observation of Task Manager while 1M rows were processed suggest about 3GB of working memory were required. This was available on this system without paging.

I can confirm Ypercube's assessment of the recursive CTE approach. With a few hundred record keys it consumed 100% of CPU and all available RAM. Eventually tempdb grew to over 80GB and the SPID crashed.

I used Paul's table with the SupergroupKey column so there's a fair comparison between the solutions.

For some reason R objected to the accent on Poincaré. Changing it to a plain "e" allowed it to run. I didn't investigate since it's not germane to the problem at hand. I'm sure there's a solution.

Here is the code

-- This captures the output from R so the base table can be updated.
drop table if exists #Results;

create table #Results
(
    Component   int         not NULL,
    Vertex      varchar(12) not NULL primary key
);


truncate table #Results;    -- facilitates re-execution

declare @Start time = sysdatetimeoffset();  -- for a 'total elapsed' calculation.

insert #Results(Component, Vertex)
exec sp_execute_external_script   
    @language = N'R',
    @input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
    @script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';

-- Write SuperGroupKey to the base table, as other solutions do
update e
set
    SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
    on r.Vertex = e.RecordKey;

-- Return all rows, as other solutions do
select
    e.SupergroupKey,
    e.GroupKey,
    e.RecordKey
from dbo.Example as e;

-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);

This is what the R code does

  • @input_data_1 is how SQL Server transfers data from a table to R code and translates it to an R dataframe called InputDataSet.

  • library(igraph) imports the library into the R execution environment.

  • df.g <- graph.data.frame(d = InputDataSet, directed = FALSE) load the data into an igraph object. This is an undirected graph since we can follow links from group to record or record to group. InputDataSet is SQL Server's default name for the dataset sent to R.

  • cpts <- components(df.g, mode = c("weak")) process the graph to find discrete sub-graphs (components) and other measures.

  • OutputDataSet <- data.frame(cpts$membership) SQL Server expects a data frame back from R. Its default name is OutputDataSet. The components are stored in a vector called "membership". This statement translates the vector to a data frame.

  • OutputDataSet$VertexName <- V(df.g)$name V() is a vector of vertices in the graph - a list of GroupKeys and RecordKeys. This copies them into the ouput data frame, creating a new column called VertexName. This is the key used to match to the source table for updating SupergroupKey.

I'm not an R expert. Likely this could be optimised.

Test Data

The OP's data was used for validation. For scale tests I used the following script.

drop table if exists Records;
drop table if exists Groups;

create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go

set nocount on;

-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount    int             = 1000000;

-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier     numeric(4, 2)   = 2.7;

-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount     int             = FLOOR(@RecordCount * @Multiplier);


-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
    ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c  -- include if needed


declare @c int = 0
while @c < @RecordCount
begin
    -- Can't use a set-based method since RAND() gives the same value for all rows.
    -- There are better ways to do this, but it works well enough.
    -- RecordKeys will be 10 letters, a-z.
    insert Records(RecordKey)
    select
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND()));

    set @c += 1;
end


-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);

truncate table dbo.Example;

select top(1) @RecordKey = RecordKey 
from Records 
where RecordKey > @RecordKey 
order by RecordKey;

while @@ROWCOUNT > 0
begin
    print @Recordkey;

    delete @Groups;

    insert @Groups(GroupKey)
    select distinct C
    from
    (
        -- Hard-code * from OP's statistics
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
    ) as T(C);

    insert dbo.Example(GroupKey, RecordKey)
    select
        GroupKey, @RecordKey
    from @Groups;

    select top(1) @RecordKey = RecordKey 
    from Records 
    where RecordKey > @RecordKey 
    order by RecordKey;
end

-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all 
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, 
      ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);


-- Check what we ended up with:
select COUNT(*) from dbo.Example;  -- Should be @RecordCount * 8
                                   -- Often a little less due to random clashes
select 
    ByGroup = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by GroupKey)) 
    from dbo.Example
) as T(C);

select
    ByRecord = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by RecordKey)) 
    from dbo.Example
) as T(C);

I've just now realised I've gotten the ratios the wrong way around from the OP's definition. I don't believe this will affect the timings. Records & Groups are symmetrical to this process. To the algorithm they're all just nodes in a graph.

In testing the data invariably formed a single component. I believe this is due to the uniform distribution of the data. If instead of the static 1:8 ratio hard-coded into the generation routine I had allowed the ratio to vary there would more likely have been further components.



1 Machine spec: Microsoft SQL Server 2017 (RTM-CU12), Developer Edition (64-bit), Windows 10 Home. 16GB RAM, SSD, 4 core hyperthreaded i7, 2.8GHz nominal. The tests were the only items running at the time, other than normal system activity (about 4% CPU).


This is an iterative T-SQL solution for performance comparison.

It assumes that an extra column can be added to the table to store the super group key, and the indexing can be changed:

Setup

DROP TABLE IF EXISTS 
    dbo.Example;

CREATE TABLE dbo.Example
(
    SupergroupKey integer NOT NULL
        DEFAULT 0, 
    GroupKey integer NOT NULL, 
    RecordKey varchar(12) NOT NULL,

    CONSTRAINT iExample 
    PRIMARY KEY CLUSTERED 
        (GroupKey ASC, RecordKey ASC),

    CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
    UNIQUE NONCLUSTERED (RecordKey, GroupKey),

    INDEX [IX dbo.Example SupergroupKey, GroupKey]
        (SupergroupKey ASC, GroupKey ASC)
);

INSERT dbo.Example
    (GroupKey, RecordKey)
VALUES 
    (1, 'Archimedes'), 
    (1, 'Newton'),
    (1, 'Euler'),
    (2, 'Euler'),
    (2, 'Gauss'),
    (3, 'Gauss'),
    (3, 'Poincaré'),
    (4, 'Ramanujan'),
    (5, 'Neumann'),
    (5, 'Grothendieck'),
    (6, 'Grothendieck'),
    (6, 'Tao');

If you are able to reverse the key order of the present primary key, the extra unique index will not be required.

Outline

This solution's approach is:

  1. Set the super group id to 1
  2. Find the lowest-numbered unprocessed group key
  3. If none found, exit
  4. Set the super group for all rows with the current group key
  5. Set the super group for all rows related to rows in the current group
  6. Repeat step 5 until no rows are updated
  7. Increment the current super group id
  8. Go to step 2

Implementation

Comments inline:

-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;

-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
    WITH (TABLOCKX)
WHERE 
    SupergroupKey != 0;

DECLARE 
    @CurrentSupergroup integer = 0,
    @CurrentGroup integer = 0;

WHILE 1 = 1
BEGIN
    -- Next super group
    SET @CurrentSupergroup += 1;

    -- Find the lowest unprocessed group key
    SELECT 
        @CurrentGroup = MIN(E.GroupKey)
    FROM dbo.Example AS E
    WHERE 
        E.SupergroupKey = 0;

    -- Exit when no more unprocessed groups
    IF @CurrentGroup IS NULL BREAK;

    -- Set super group for all records in the current group
    UPDATE E
    SET E.SupergroupKey = @CurrentSupergroup
    FROM dbo.Example AS E 
    WHERE 
        E.GroupKey = @CurrentGroup;

    -- Iteratively find all groups for the super group
    WHILE 1 = 1
    BEGIN
        WITH 
            RecordKeys AS
            (
                SELECT DISTINCT
                    E.RecordKey
                FROM dbo.Example AS E
                WHERE
                    E.SupergroupKey = @CurrentSupergroup
            ),
            GroupKeys AS
            (
                SELECT DISTINCT
                    E.GroupKey
                FROM RecordKeys AS RK
                JOIN dbo.Example AS E
                    WITH (FORCESEEK)
                    ON E.RecordKey = RK.RecordKey
            )
        UPDATE E WITH (TABLOCKX)
        SET SupergroupKey = @CurrentSupergroup
        FROM GroupKeys AS GK
        JOIN dbo.Example AS E
            ON E.GroupKey = GK.GroupKey
        WHERE
            E.SupergroupKey = 0
        OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better

        -- Break when no more related groups found
        IF @@ROWCOUNT = 0 BREAK;
    END;
END;

SELECT
    E.SupergroupKey,
    E.GroupKey,
    E.RecordKey
FROM dbo.Example AS E;

Execution plan

For the key update:

Update plan

Result

The final state of the table is:

╔═══════════════╦══════════╦══════════════╗
║ SupergroupKey ║ GroupKey ║  RecordKey   ║
╠═══════════════╬══════════╬══════════════╣
║             1 ║        1 ║ Archimedes   ║
║             1 ║        1 ║ Euler        ║
║             1 ║        1 ║ Newton       ║
║             1 ║        2 ║ Euler        ║
║             1 ║        2 ║ Gauss        ║
║             1 ║        3 ║ Gauss        ║
║             1 ║        3 ║ Poincaré     ║
║             2 ║        4 ║ Ramanujan    ║
║             3 ║        5 ║ Grothendieck ║
║             3 ║        5 ║ Neumann      ║
║             3 ║        6 ║ Grothendieck ║
║             3 ║        6 ║ Tao          ║
╚═══════════════╩══════════╩══════════════╝

Demo: db<>fiddle

Performance tests

Using the expanded test data set provided in Michael Green's answer, timings on my laptop* are:

╔═════════════╦════════╗
║ Record Keys ║  Time  ║
╠═════════════╬════════╣
║ 10k         ║ 2s     ║
║ 100k        ║ 12s    ║
║ 1M          ║ 2m 30s ║
╚═════════════╩════════╝

* Microsoft SQL Server 2017 (RTM-CU13), Developer Edition (64-bit), Windows 10 Pro, 16GB RAM, SSD, 4 core hyperthreaded i7, 2.4GHz nominal.


A recursive CTE method - that is likely to be horribly inefficient in big tables:

WITH rCTE AS 
(
    -- Anchor
    SELECT 
        GroupKey, RecordKey, 
        CAST('|' + CAST(GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS GroupKeys,
        CAST('|' + CAST(RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS RecordKeys,
        1 AS lvl
    FROM Example

    UNION ALL

    -- Recursive
    SELECT
        e.GroupKey, e.RecordKey, 
        CASE WHEN r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.GroupKeys + CAST(e.GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.GroupKeys
        END,
        CASE WHEN r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.RecordKeys + CAST(e.RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.RecordKeys
        END,
        r.lvl + 1
    FROM rCTE AS r
         JOIN Example AS e
         ON  e.RecordKey = r.RecordKey
         AND r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
         -- 
         OR e.GroupKey = r.GroupKey
         AND r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
)
SELECT 
    ROW_NUMBER() OVER (ORDER BY GroupKeys) AS SuperGroupKey,
    GroupKeys, RecordKeys
FROM rCTE AS c
WHERE NOT EXISTS
      ( SELECT 1
        FROM rCTE AS m
        WHERE m.lvl > c.lvl
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
        OR    m.lvl = c.lvl
          AND ( m.GroupKey > c.GroupKey
             OR m.GroupKey = c.GroupKey
             AND m.RecordKeys > c.RecordKeys
              )
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
          AND c.GroupKeys LIKE '%|' + CAST(m.GroupKey AS VARCHAR(10)) + '|%'
      ) 
OPTION (MAXRECURSION 0) ;

Tested in dbfiddle.uk