Locate the smallest missing element based on a specific formula

There are a few challenges with this question. Indexes in SQL Server can do the following very efficiently with just a few logical read each:

  • check that a row exists
  • check that a row doesn't exist
  • find the next row starting at some point
  • find the previous row starting at some point

However, they cannot be used to find the Nth row in an index. Doing that requires you roll your own index stored as a table or to scan the first N rows in the index. Your C# code heavily relies on the fact that you can efficiently find the Nth element of the array, but you can't do that here. I think that algorithm isn't usable for T-SQL without a data model change.

The second challenge relates to the restrictions on the BINARY data types. As far as I can tell you cannot perform addition, subtraction, or division in the usual ways. You can convert your BINARY(64) to a BIGINT and it won't throw conversion errors, but the behavior is not defined:

Conversions between any data type and the binary data types are not guaranteed to be the same between versions of SQL Server.

In addition, the lack of conversion errors is somewhat of a problem here. You can convert anything larger than the biggest possible BIGINT value but it'll give you the wrong results.

It's true that you have values right now that are bigger than 9223372036854775807. However, if you're always starting at 1 and searching for the smallest minimum value then those large values cannot be relevant unless your table has more than 9223372036854775807 rows. This seems unlikely because your table at that point would be around 2000 exabytes, so for the purposes of answering your question I'm going to assume that the very large values do not need to be searched. I'm also going to do data type conversion because they seem to be unavoidable.

For the test data, I inserted the equivalent of 50 million sequential integers into a table along with 50 million more integers with a single value gap about every 20 values. I also inserted a single value that won't properly fit in a signed BIGINT:

CREATE TABLE dbo.BINARY_PROBLEMS (
    KeyCol BINARY(64) NOT NULL
);

INSERT INTO dbo.BINARY_PROBLEMS WITH (TABLOCK)
SELECT CAST(SUM(OFFSET) OVER (ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS BINARY(64))
FROM
(
    SELECT 1 + CASE WHEN t.RN > 50000000 THEN
        CASE WHEN ABS(CHECKSUM(NewId()) % 20)  = 10 THEN 1 ELSE 0 END
    ELSE 0 END OFFSET
    FROM
    (
        SELECT TOP (100000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
        CROSS JOIN master..spt_values t3
    ) t
) tt
OPTION (MAXDOP 1);

CREATE UNIQUE CLUSTERED INDEX CI_BINARY_PROBLEMS ON dbo.BINARY_PROBLEMS (KeyCol);

-- add a value too large for BIGINT
INSERT INTO dbo.BINARY_PROBLEMS
SELECT CAST(0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000000000000 AS BINARY(64));

That code took a few minutes to run on my machine. I made the first half of the table not have any gaps to represent a sort of worse case for performance. The code that I used to solve the problem scans the index in order so it will finish very quickly if the first gap is early on in the table. Before we get to that let's verify that the data is as it should be:

SELECT TOP (2) KeyColBigInt
FROM
(
    SELECT KeyCol
    , CAST(KeyCol AS BIGINT) KeyColBigInt
    FROM dbo.BINARY_PROBLEMS
) t
ORDER By KeyCol DESC;

The results suggest that the maximum value that we converts to BIGINT is 102500672:

╔══════════════════════╗
║     KeyColBigInt     ║
╠══════════════════════╣
║ -9223372036854775808 ║
║            102500672 ║
╚══════════════════════╝

There are 100 million rows with values that fit into BIGINT as expected:

SELECT COUNT(*) 
FROM dbo.BINARY_PROBLEMS
WHERE KeyCol < 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007FFFFFFFFFFFFFFF;

One approach to this problem is to scan the index in order and to quit as soon as a row's value doesn't match the expected ROW_NUMBER() value. The entire table does not need to be scanned to get the first row: only the rows up until the first gap. Here's one way to write code that is likely to get that query plan:

SELECT TOP (1) KeyCol
FROM
(
    SELECT KeyCol
    , CAST(KeyCol AS BIGINT) KeyColBigInt
    , ROW_NUMBER() OVER (ORDER BY KeyCol) RN
    FROM dbo.BINARY_PROBLEMS
    WHERE KeyCol < 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007FFFFFFFFFFFFFFF
) t
WHERE KeyColBigInt <> RN
ORDER BY KeyCol;

For reasons that won't fit in this answer, this query will often be run serially by SQL Server and SQL Server will often underestimate the number of rows that need to be scanned before the first match is found. On my machine, SQL Server scans 50000022 rows from the index before finding the first match. The query takes 11 seconds to run. Note that this returns the first value past the gap. It's not clear which row you want exactly, but you should be able to change the query to fit your needs without a lot of trouble. Here's what the plan looks like:

serial plan

My only other idea was to bully SQL Server into using parallelism for the query. I have four CPUs, so I'm going to split the data up into four ranges and do seeks on those ranges. Each CPU will be assigned a range. To calculate the ranges I just grabbed the max value and assumed that data was evenly distributed. If you want to be smarter about it you could look at a sampled stats histogram for the column values and build your ranges that way. The code below relies on a lot of undocumented tricks that aren't safe for production, including trace flag 8649:

SELECT TOP 1 ca.KeyCol
FROM (
    SELECT 1 bucket_min_value, 25625168 bucket_max_value
    UNION ALL
    SELECT 25625169, 51250336
    UNION ALL
    SELECT 51250337, 76875504
    UNION ALL
    SELECT 76875505, 102500672
) buckets
CROSS APPLY (
    SELECT TOP 1 t.KeyCol
    FROM
    (
        SELECT KeyCol
        , CAST(KeyCol AS BIGINT) KeyColBigInt
        , buckets.bucket_min_value - 1 + ROW_NUMBER() OVER (ORDER BY KeyCol) RN
        FROM dbo.BINARY_PROBLEMS
        WHERE KeyCol >= CAST(buckets.bucket_min_value AS BINARY(64)) AND KeyCol <=  CAST(buckets.bucket_max_value AS BINARY(64))
    ) t
    WHERE t.KeyColBigInt <> t.RN
    ORDER BY t.KeyCol
) ca
ORDER BY ca.KeyCol
OPTION (QUERYTRACEON 8649);

Here is what the parallel nested loop pattern looks like:

parallel plan

Overall, the query does more work than before since it'll scan more rows in the table. However, it now runs in 7 seconds on my desktop. It might parallelize better on a real server. Here's a link to the actual plan.

I really can't think of a good way to solve this problem. Doing the calculation outside of SQL or changing the data model may be your best bets.


Joe's already hit on most of the points I just spent an hour typing up, in summary:

  • highly doubtful you'll every run out of KeyCol values < bigint max (9.2e18), so conversions (if necessary) to/from bigint should not be a problem as long as you limit searches to KeyCol <= 0x00..007FFFFFFFFFFFFFFF
  • I can't think of a query that's going to 'efficiently' find a gap all the time; you may get lucky and find a gap near the beginning of your search, or you could pay dearly to find the gap quite a ways into your search
  • while I briefly thought about how to parallelize the query, I quickly discarded that idea (as a DBA I would not want to find out that your process is routinely bogging down my dataserver with 100% cpu utilization ... especially if you could have multiple copies of this running at the same time); noooo ... parallelization is going to be out of the question

So, what to do?

Let's put the (repeated, cpu-intensive, brute force) search idea on hold for a minute and look at the bigger picture.

  • on an average basis one instance of this search is going to need to scan millions of index keys (and require a good bit of cpu, thrashing of db cache, and a user watching a spinning hour glass) just to locate a single value
  • multiply the cpu-usage/cache-thrashing/spinning-hour-glass by ... how many searches do you expect in a day?
  • keep in mind that, generally speaking, each instance of this search is going to need to scan the same set of (millions of) index keys; that's a LOT of repeated activity for such minimal benefit

What I'd like to propose is some additions to the data model ...

  • a new table that keeps track of a set of 'available to use' KeyCol values, eg: available_for_use(KeyCol binary(64) not null primary key)
  • how many records you maintain in this table is up to you to decide, eg, perhaps enough for a month's worth of activity?
  • the table can periodically (weekly?) be 'topped off' with a new batch of KeyCol values (perhaps create a 'top off' stored proc?) [eg, update Joe's select/top/row_number() query to do a top 100000]
  • you could setup a monitoring process to keep track of the number of available entries in available_for_use just in case you ever start to run low on values
  • a new (or modified) DELETE trigger on the >main_table< that places deleted KeyCol values into our new table available_for_use whenever a row is deleted from the main table
  • if you allow updates of the KeyCol column then a new/modified UPDATE trigger on the >main_table< to also keep our new table available_for_use updated
  • when it comes time to 'search' for a new KeyCol value you select min(KeyCol) from available_for_use (obviously there's a bit more to this since a) you'll need to code for concurrency issues - don't want 2 copies of your process grabbing the same min(KeyCol) and b) you'll need to delete min(KeyCol) from the table; this should be relatively easy to code, perhaps as a stored proc, and can be addressed in another Q&A if necessary)
  • in a worst case scenario, if your select min(KeyCol) process finds no available rows, you could kick off your 'top off' proc to generate a new batch of rows

With these proposed changes to the data model:

  • you eliminate a LOT of excessive cpu cycles [your DBA will thank you]
  • you eliminate ALL of those repetitive index scans and cache thrashing [your DBA will thank you]
  • your users no longer have to watch the spinning hour glass (though they may not like the loss of an excuse to step away from their desk)
  • there are plenty of ways to monitor the size of the available_for_use table to make sure you never run out of new values

Yes, the proposed available_for_use table is just a table of pre-generated 'next key' values; and yes, there's a potential for some contention when grabbing the 'next' value, but any contention a) is easily addressed through proper table/index/query design and b) is going to be minor/short-lived compared to the overhead/delays with the current idea of repeated, brute force, index searches.