Collect all Similar Persons to One Group

It will be better to make such calculations in iterative manner.

The "issue" of such solution is that, in case of large source table, the output table will contains all the rows. But, actually, this is the benefit also: in case of large source table, such code will work much more faster, that recursions.


if object_id('tempdb..#src') is not null drop table #src;
create table #src (id1 int not null, id2 int)

if object_id('tempdb..#rez') is not null drop table #rez;
create table #rez (id int not null, chainid int not null);

declare @chainId int;

insert #src
values (10, 20), (10, 30), (30, 30), (10, 40), (50, 70), (60, 50), (70, 70)
--values (4457745,255714),(4457745,2540222),(2540222,4457745),(255714,4457745)

set @chainId = 1

while 1 = 1
begin
    insert #rez(id, chainid)
    select top 1 id1, @chainId
    from #src
    where
        Id1 NOT IN (select Id from #rez)
        and id2 NOT IN (select Id from #rez)

    if @@rowcount = 0 break

    while 1 = 1
    begin
        insert #rez(id, chainid)

        select id1, @chainid
        from #src
        where
            id2 in (select id from #rez where chainid = @chainId)
            and id1 not in (select id from #rez where chainid = @chainId)

        union

        select id2, @chainId
        from #src
        where
            id1 in (select id from #rez where chainid = @chainId)
            and id2 not in (select id from #rez where chainid = @chainId)

        if @@rowcount = 0 break
    end

    select @chainId = @chainId + 1
end

select [NewId] = chainid, OldID = id
from #rez 
order by 1, 2;

In some details, the "logic" of the script is the following:

  • set new ID (chainid) initial value (ex. 1)

    • get the first entry which was not found yet (external cycle)

      • find and store all related to it (in both columns) (internal cycle)
    • increment the new ID

    • find next entry which is not in the result table

    • continue till have data to work

  • return sorted result


Here I adapted my old answer from a similar SO question How to find all connected subgraphs of an undirected graph.

This variant doesn't use a cursor or explicit loop, but uses a single recursive query.

Essentially, it treats the data as edges in a graph and traverses recursively all edges of the graph, stopping when the loop is detected. Then it puts all found loops in groups and gives each group a number.

See the detailed explanations of how it works below. I recommend you to run the query CTE-by-CTE and examine each intermediate result to understand what it does.

Sample data

DECLARE @T TABLE (ID1 int NOT NULL, ID2 int NOT NULL);
INSERT INTO @T (ID1, ID2) VALUES
(10, 20),
(10, 30),
(30, 30),
(10, 40),
(50, 70),
(60, 50),
(70, 70),
(4457745, 255714),
(4457745,2540222),
(2540222,4457745),
( 255714,4457745);

Query

WITH
CTE_Ids
AS
(
    SELECT ID1 AS ID
    FROM @T

    UNION

    SELECT ID2 AS ID
    FROM @T
)
,CTE_Pairs
AS
(
    SELECT ID1, ID2
    FROM @T
    WHERE ID1 <> ID2

    UNION

    SELECT ID2 AS ID1, ID1 AS ID2
    FROM @T
    WHERE ID1 <> ID2
)
,CTE_Recursive
AS
(
    SELECT
        CAST(CTE_Ids.ID AS varchar(8000)) AS AnchorID
        ,ID1
        ,ID2
        ,CAST(
            ',' + CAST(ID1 AS varchar(8000)) + 
            ',' + CAST(ID2 AS varchar(8000)) + 
            ',' AS varchar(8000)) AS IdPath
        ,1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Ids ON CTE_Ids.ID = CTE_Pairs.ID1

    UNION ALL

    SELECT
        CTE_Recursive.AnchorID
        ,CTE_Pairs.ID1
        ,CTE_Pairs.ID2
        ,CAST(
            CTE_Recursive.IdPath + 
            CAST(CTE_Pairs.ID2 AS varchar(8000)) +
            ',' AS varchar(8000)) AS IdPath
        ,CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.ID2 = CTE_Pairs.ID1
    WHERE
        CTE_Recursive.IdPath NOT LIKE 
            CAST('%,' + CAST(CTE_Pairs.ID2 AS varchar(8000)) + ',%' AS varchar(8000))
)
,CTE_RecursionResult
AS
(
    SELECT AnchorID, ID1, ID2
    FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
    SELECT AnchorID, ID1 AS ID
    FROM CTE_RecursionResult

    UNION

    SELECT AnchorID, ID2 AS ID
    FROM CTE_RecursionResult
)
SELECT
    DENSE_RANK() OVER (ORDER BY CA_Data.XML_Value) AS [NewID]
    ,CTE_Ids.ID AS [OldID]
    ,CA_Data.XML_Value AS GroupMembers
FROM
    CTE_Ids
    CROSS APPLY
    (
        SELECT CAST(CTE_CleanResult.ID AS nvarchar(max)) + ','
        FROM CTE_CleanResult
        WHERE CTE_CleanResult.AnchorID = CTE_Ids.ID
        ORDER BY CTE_CleanResult.ID FOR XML PATH(''), TYPE
    ) AS CA_XML(XML_Value)
    CROSS APPLY
    (
        SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
    ) AS CA_Data(XML_Value)
ORDER BY [NewID], [OldID];

Result

+-------+---------+-------------------------+
| NewID |  OldID  |      GroupMembers       |
+-------+---------+-------------------------+
|     1 |      10 | 10,20,30,40,            |
|     1 |      20 | 10,20,30,40,            |
|     1 |      30 | 10,20,30,40,            |
|     1 |      40 | 10,20,30,40,            |
|     2 |  255714 | 255714,2540222,4457745, |
|     2 | 2540222 | 255714,2540222,4457745, |
|     2 | 4457745 | 255714,2540222,4457745, |
|     3 |      50 | 50,60,70,               |
|     3 |      60 | 50,60,70,               |
|     3 |      70 | 50,60,70,               |
+-------+---------+-------------------------+

How it works

CTE_Ids

CTE_Ids gives the list of all Identifiers that appear in both ID1 and ID2 columns. Since they can appear in any order we UNION both columns together. UNION also removes any duplicates.

+---------+
|   ID    |
+---------+
|      10 |
|      20 |
|      30 |
|      40 |
|      50 |
|      60 |
|      70 |
|  255714 |
| 2540222 |
| 4457745 |
+---------+

CTE_Pairs

CTE_Pairs gives the list of all edges of the graph in both directions. Again, UNION is used to remove any duplicates.

+---------+---------+
|   ID1   |   ID2   |
+---------+---------+
|      10 |      20 |
|      10 |      30 |
|      10 |      40 |
|      20 |      10 |
|      30 |      10 |
|      40 |      10 |
|      50 |      60 |
|      50 |      70 |
|      60 |      50 |
|      70 |      50 |
|  255714 | 4457745 |
| 2540222 | 4457745 |
| 4457745 |  255714 |
| 4457745 | 2540222 |
+---------+---------+

CTE_Recursive

CTE_Recursive is the main part of the query that recursively traverses the graph starting from each unique Identifier. These starting rows are produced by the first part of UNION ALL. The second part of UNION ALL recursively joins to itself linking ID2 to ID1. Since we pre-made CTE_Pairs with all edges written in both directions, we can always link only ID2 to ID1 and we'll get all paths in the graph. At the same time the query builds IdPath - a string of comma-delimited Identifiers that have been traversed so far. It is used in the WHERE filter:

CTE_Recursive.IdPath NOT LIKE 
    CAST('%,' + CAST(CTE_Pairs.ID2 AS varchar(8000)) + ',%' AS varchar(8000))

As soon as we come across the Identifier that had been included in the Path before, the recursion stops as the list of connected nodes is exhausted. AnchorID is the starting Identifier for the recursion, it will be used later to group results. Lvl is not really used, I included it for better understanding of what is going on.

+----------+---------+---------+--------------------------+-----+
| AnchorID |   ID1   |   ID2   |          IdPath          | Lvl |
+----------+---------+---------+--------------------------+-----+
|       10 |      10 |      20 | ,10,20,                  |   1 |
|       10 |      10 |      30 | ,10,30,                  |   1 |
|       10 |      10 |      40 | ,10,40,                  |   1 |
|       20 |      20 |      10 | ,20,10,                  |   1 |
|       30 |      30 |      10 | ,30,10,                  |   1 |
|       40 |      40 |      10 | ,40,10,                  |   1 |
|       50 |      50 |      60 | ,50,60,                  |   1 |
|       50 |      50 |      70 | ,50,70,                  |   1 |
|       60 |      60 |      50 | ,60,50,                  |   1 |
|       70 |      70 |      50 | ,70,50,                  |   1 |
|   255714 |  255714 | 4457745 | ,255714,4457745,         |   1 |
|  2540222 | 2540222 | 4457745 | ,2540222,4457745,        |   1 |
|  4457745 | 4457745 |  255714 | ,4457745,255714,         |   1 |
|  4457745 | 4457745 | 2540222 | ,4457745,2540222,        |   1 |
|  2540222 | 4457745 |  255714 | ,2540222,4457745,255714, |   2 |
|   255714 | 4457745 | 2540222 | ,255714,4457745,2540222, |   2 |
|       70 |      50 |      60 | ,70,50,60,               |   2 |
|       60 |      50 |      70 | ,60,50,70,               |   2 |
|       40 |      10 |      20 | ,40,10,20,               |   2 |
|       40 |      10 |      30 | ,40,10,30,               |   2 |
|       30 |      10 |      20 | ,30,10,20,               |   2 |
|       30 |      10 |      40 | ,30,10,40,               |   2 |
|       20 |      10 |      30 | ,20,10,30,               |   2 |
|       20 |      10 |      40 | ,20,10,40,               |   2 |
+----------+---------+---------+--------------------------+-----+

CTE_CleanResult

CTE_CleanResult leaves only relevant parts from CTE_Recursive and again merges both ID1 and ID2 using UNION.

+----------+---------+
| AnchorID |   ID    |
+----------+---------+
|       10 |      10 |
|       10 |      20 |
|       10 |      30 |
|       10 |      40 |
|       20 |      10 |
|       20 |      20 |
|       20 |      30 |
|       20 |      40 |
|  2540222 |  255714 |
|  2540222 | 2540222 |
|  2540222 | 4457745 |
|   255714 |  255714 |
|   255714 | 2540222 |
|   255714 | 4457745 |
|       30 |      10 |
|       30 |      20 |
|       30 |      30 |
|       30 |      40 |
|       40 |      10 |
|       40 |      20 |
|       40 |      30 |
|       40 |      40 |
|  4457745 |  255714 |
|  4457745 | 2540222 |
|  4457745 | 4457745 |
|       50 |      50 |
|       50 |      60 |
|       50 |      70 |
|       60 |      50 |
|       60 |      60 |
|       60 |      70 |
|       70 |      50 |
|       70 |      60 |
|       70 |      70 |
+----------+---------+

Final SELECT

Now we need to build a string of comma-separated ID values for each AnchorID. CROSS APPLY with FOR XML does it. DENSE_RANK() calculates the NewID numbers for each AnchorID.


This query builds a full path for each identifier, which is not really necessary.

It is possible to make the overall solution more efficient by running this query once for each group/subgraph, in a loop.

Pick a single starting Identifier (add WHERE ID = 10 to CTE_Ids and CTE_Pairs), then delete from the source table all identifiers that were appended to this group. Repeat until the big table is empty.