How can I convert the first 100 million positive integers to strings?

Your solution runs for 35 seconds on my laptop. The following code takes 26 seconds (including creating and populating the temporary tables):

Temporary tables

DROP TABLE IF EXISTS #T1, #T2, #T3, #T4;

CREATE TABLE #T1 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T2 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T3 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T4 (string varchar(6) NOT NULL PRIMARY KEY);

INSERT #T1 (string)
VALUES
    ('A'), ('B'), ('C'), ('D'), ('E'), ('F'), ('G'),
    ('H'), ('I'), ('J'), ('K'), ('L'), ('M'), ('N'),
    ('O'), ('P'), ('Q'), ('R'), ('S'), ('T'), ('U'),
    ('V'), ('W'), ('X'), ('Y'), ('Z');

INSERT #T2 (string)
SELECT T1a.string + T1b.string
FROM #T1 AS T1a, #T1 AS T1b;

INSERT #T3 (string)
SELECT #T2.string + #T1.string
FROM #T2, #T1;

INSERT #T4 (string)
SELECT #T3.string + #T1.string
FROM #T3, #T1;

The idea there is to pre-populate ordered combinations of up to four characters.

Main code

SELECT TOP (100000000)
    UA.string + UA.string2
FROM
(
    SELECT U.Size, U.string, string2 = '' FROM 
    (
        SELECT Size = 1, string FROM #T1
        UNION ALL
        SELECT Size = 2, string FROM #T2
        UNION ALL
        SELECT Size = 3, string FROM #T3
        UNION ALL
        SELECT Size = 4, string FROM #T4
    ) AS U
    UNION ALL
    SELECT Size = 5, #T1.string, string2 = #T4.string
    FROM #T1, #T4
    UNION ALL
    SELECT Size = 6, #T2.string, #T4.string
    FROM #T2, #T4
) AS UA
ORDER BY 
    UA.Size, 
    UA.string, 
    UA.string2
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 1);

That is a simple order-preserving union* of the four precalculated tables, with 5-character and 6-character strings derived as needed. Separating the prefix from the suffix avoids sorting.

Execution plan

100 million rows

* There's nothing in the SQL above that specifies an order-preserving union directly. The optimizer chooses physical operators with properties that match the SQL query specification, including top-level order by. Here, it chooses concatenation implemented by the merge join physical operator to avoid sorting.

The guarantee is that the execution plan delivers the query semantic and top-level order by specification. Knowing that merge join concat preserves order allows the query writer to anticipate an execution plan, but the optimizer will only deliver if the expectation is valid.

I'll post an answer to get started. My first thought was that it should be possible to take advantage of the order-preserving nature of a nested loop join along with a few helper tables that have one row for each letter. The tricky part was going to be looping in such a way that the results were ordered by length as well as avoiding duplicates. For example, when cross joining a CTE that includes all 26 capital letters along with '', you can end up generating 'A' + '' + 'A' and '' + 'A' + 'A' which is of course the same string.

The first decision was where to store the helper data. I tried using a temp table but this had a surprisingly negative impact on performance, even though the data fit into a single page. The temp table contained the below data:

SELECT 'A'
UNION ALL SELECT 'B'
...
UNION ALL SELECT 'Y'
UNION ALL SELECT 'Z'

Compared to using a CTE, the query took 3X longer with a clustered table and 4X longer with a heap. I don't believe the problem is that the data is on disk. It should be read into memory as a single page and processed in memory for the entire plan. Perhaps SQL Server can work with data from a Constant Scan operator more efficiently than it can with data stored in typical rowstore pages.

Interestingly, SQL Server chooses to put the ordered results from a single page tempdb table with ordered data into a table spool:

bad spoool

SQL Server often puts results for the inner table of a cross join into a table spool, even if it seems nonsensical to do so. I think that the optimizer needs a little bit of work in this area. I ran the query with the NO_PERFORMANCE_SPOOL to avoid the performance hit.

One problem with using a CTE to store the helper data is that the data isn't guaranteed to be ordered. I can't think of why the optimizer would choose not to order it and in all of my tests the data was processed in the order that I wrote the CTE:

constant scan order

However, best not to take any chances, especially if there's a way to do it without a large performance overhead. It's possible to order the data in a derived table by adding a superfluous TOP operator. For example:

(SELECT TOP (26) CHR FROM FIRST_CHAR ORDER BY CHR)

That addition to the query should guarantee that results will be returned in the correct order. I expected all of the sorts to have a large negative performance impact. The query optimizer expected this as well based on the estimated costs:

expensive sorts

Very surprisingly, I could not observe any statistically significant difference in cpu time or runtime with or without explicit ordering. If anything, the query seemed to run faster with the ORDER BY! I have no explanation for this behavior.

The tricky part of the problem was to figure out how to insert blank characters into the right places. As mentioned before a simple CROSS JOIN would result in duplicate data. We know that the 100000000th string will have a length of six characters because:

26 + 26 ^2 + 26^3 + 26^4 + 26^5 = 914654 < 100000000

but

26 + 26 ^2 + 26^3 + 26^4 + 26^5 + 26 ^ 6 = 321272406 > 100000000

Therefore we only need to join to the letter CTE six times. Suppose that we join to the CTE six times, grab one letter from each CTE, and concatenate them all together. Suppose the leftmost letter is not blank. If any of the subsequent letters are blank that means that the string is less than six characters long so it is a duplicate. Therefore, we can prevent duplicates by finding the first non-blank character and requiring all characters after it also not be blank. I chose to track this by assigning a FLAG column to one of the CTEs and by adding a check to the WHERE clause. This should be more clear after looking at the query. The final query is as follows:

WITH FIRST_CHAR (CHR) AS
(
    SELECT 'A'
    UNION ALL SELECT 'B'
    UNION ALL SELECT 'C'
    UNION ALL SELECT 'D'
    UNION ALL SELECT 'E'
    UNION ALL SELECT 'F'
    UNION ALL SELECT 'G'
    UNION ALL SELECT 'H'
    UNION ALL SELECT 'I'
    UNION ALL SELECT 'J'
    UNION ALL SELECT 'K'
    UNION ALL SELECT 'L'
    UNION ALL SELECT 'M'
    UNION ALL SELECT 'N'
    UNION ALL SELECT 'O'
    UNION ALL SELECT 'P'
    UNION ALL SELECT 'Q'
    UNION ALL SELECT 'R'
    UNION ALL SELECT 'S'
    UNION ALL SELECT 'T'
    UNION ALL SELECT 'U'
    UNION ALL SELECT 'V'
    UNION ALL SELECT 'W'
    UNION ALL SELECT 'X'
    UNION ALL SELECT 'Y'
    UNION ALL SELECT 'Z'
)
, ALL_CHAR (CHR, FLAG) AS
(
    SELECT '', 0 CHR
    UNION ALL SELECT 'A', 1
    UNION ALL SELECT 'B', 1
    UNION ALL SELECT 'C', 1
    UNION ALL SELECT 'D', 1
    UNION ALL SELECT 'E', 1
    UNION ALL SELECT 'F', 1
    UNION ALL SELECT 'G', 1
    UNION ALL SELECT 'H', 1
    UNION ALL SELECT 'I', 1
    UNION ALL SELECT 'J', 1
    UNION ALL SELECT 'K', 1
    UNION ALL SELECT 'L', 1
    UNION ALL SELECT 'M', 1
    UNION ALL SELECT 'N', 1
    UNION ALL SELECT 'O', 1
    UNION ALL SELECT 'P', 1
    UNION ALL SELECT 'Q', 1
    UNION ALL SELECT 'R', 1
    UNION ALL SELECT 'S', 1
    UNION ALL SELECT 'T', 1
    UNION ALL SELECT 'U', 1
    UNION ALL SELECT 'V', 1
    UNION ALL SELECT 'W', 1
    UNION ALL SELECT 'X', 1
    UNION ALL SELECT 'Y', 1
    UNION ALL SELECT 'Z', 1
)
SELECT TOP (100000000)
d6.CHR + d5.CHR + d4.CHR + d3.CHR + d2.CHR + d1.CHR
FROM (SELECT TOP (27) FLAG, CHR FROM ALL_CHAR ORDER BY CHR) d6
CROSS JOIN (SELECT TOP (27) FLAG, CHR FROM ALL_CHAR ORDER BY CHR) d5
CROSS JOIN (SELECT TOP (27) FLAG, CHR FROM ALL_CHAR ORDER BY CHR) d4
CROSS JOIN (SELECT TOP (27) FLAG, CHR FROM ALL_CHAR ORDER BY CHR) d3
CROSS JOIN (SELECT TOP (27) FLAG, CHR FROM ALL_CHAR ORDER BY CHR) d2
CROSS JOIN (SELECT TOP (26) CHR FROM FIRST_CHAR ORDER BY CHR) d1
WHERE (d2.FLAG + d3.FLAG + d4.FLAG + d5.FLAG + d6.FLAG) =
    CASE 
    WHEN d6.FLAG = 1 THEN 5
    WHEN d5.FLAG = 1 THEN 4
    WHEN d4.FLAG = 1 THEN 3
    WHEN d3.FLAG = 1 THEN 2
    WHEN d2.FLAG = 1 THEN 1
    ELSE 0 END
OPTION (MAXDOP 1, FORCE ORDER, LOOP JOIN, NO_PERFORMANCE_SPOOL);

The CTEs are as described above. ALL_CHAR is joined to five times because it includes a row for a blank character. The final character in the string should never be blank so a separate CTE is defined for it, FIRST_CHAR. The extra flag column in ALL_CHAR is used to prevent duplicates as described above. There may be a more efficient way to do this check but there are definitely more inefficient ways to do it. One attempt by me with LEN() and POWER() made the query run six times slower than the current version.

The MAXDOP 1 and FORCE ORDER hints are essential to make sure that the order is preserved in the query. An annotated estimated plan might be helpful to see why the joins are in their current order:

annotated estimated

Query plans are often read right to left but row requests happen from left to right. Ideally, SQL Server will request exactly 100 million rows from the d1 constant scan operator. As you move from left to right I expect fewer rows to be requested from each operator. We can see this in the actual execution plan. Additionally, below is a screenshot from SQL Sentry Plan Explorer:

explorer

We got exactly 100 million rows from d1 which is a good thing. Note that the ratio of rows between d2 and d3 is almost exactly 27:1 (165336 * 27 = 4464072) which makes sense if you think about how the cross join will work. The ratio of rows between d1 and d2 is 22.4 which represents some wasted work. I believe the extra rows are from duplicates (due to the blank characters in the middle of the strings) which do not make it past the nested loop join operator that does the filtering.

The LOOP JOIN hint is technically unnecessary because a CROSS JOIN can only be implemented as a loop join in SQL Server. The NO_PERFORMANCE_SPOOL is to prevent the unnecessary table spooling. Omitting the spool hint made the query take 3X longer on my machine.

The final query has a cpu time of around 17 seconds and a total elapsed time of 18 seconds. That was when running the query through SSMS and discarding the result set. I am very interested in seeing other methods of generating the data.

I have a solution optimized to obtain the string code for any specific number up to 217,180,147,158 (8 chars). But I cannot beat your time:

On my machine, with SQL Server 2014, your query takes 18 seconds, while mine takes 3m 46s. Both queries use undocumented trace flag 8690 because 2014 does not support the NO_PERFORMANCE_SPOOL hint.

Here's the code:

/* precompute offsets and powers to simplify final query */
CREATE TABLE #ExponentsLookup (
    offset          BIGINT NOT NULL,
    offset_end      BIGINT NOT NULL,
    position        INTEGER NOT NULL,
    divisor         BIGINT NOT NULL,
    shifts          BIGINT NOT NULL,
    chars           INTEGER NOT NULL,
    PRIMARY KEY(offset, offset_end, position)
);

WITH base_26_multiples AS ( 
    SELECT  number  AS exponent,
            CAST(POWER(26.0, number) AS BIGINT) AS multiple
    FROM    master.dbo.spt_values
    WHERE   [type] = 'P'
            AND number < 8
),
num_offsets AS (
    SELECT  *,
            -- The maximum posible value is 217180147159 - 1
            LEAD(offset, 1, 217180147159) OVER(
                ORDER BY exponent
            ) AS offset_end
    FROM    (
                SELECT  exponent,
                        SUM(multiple) OVER(
                            ORDER BY exponent
                        ) AS offset
                FROM    base_26_multiples
            ) x
)
INSERT INTO #ExponentsLookup(offset, offset_end, position, divisor, shifts, chars)
SELECT  ofst.offset, ofst.offset_end,
        dgt.number AS position,
        CAST(POWER(26.0, dgt.number) AS BIGINT)     AS divisor,
        CAST(POWER(256.0, dgt.number) AS BIGINT)    AS shifts,
        ofst.exponent + 1                           AS chars
FROM    num_offsets ofst
        LEFT JOIN master.dbo.spt_values dgt --> as many rows as resulting chars in string
            ON [type] = 'P'
            AND dgt.number <= ofst.exponent;

/*  Test the cases in table example */
SELECT  /*  1.- Get the base 26 digit and then shift it to align it to 8 bit boundaries
            2.- Sum the resulting values
            3.- Bias the value with a reference that represent the string 'AAAAAAAA'
            4.- Take the required chars */
        ref.[row_number],
        REVERSE(SUBSTRING(REVERSE(CAST(SUM((((ref.[row_number] - ofst.offset) / ofst.divisor) % 26) * ofst.shifts) +
            CAST(CAST('AAAAAAAA' AS BINARY(8)) AS BIGINT) AS BINARY(8))),
            1, MAX(ofst.chars))) AS string
FROM    (
            VALUES(1),(2),(25),(26),(27),(28),(51),(52),(53),(54),
            (18278),(18279),(475253),(475254),(475255),
            (100000000), (CAST(217180147158 AS BIGINT))
        ) ref([row_number])
        LEFT JOIN #ExponentsLookup ofst
            ON ofst.offset <= ref.[row_number]
            AND ofst.offset_end > ref.[row_number]
GROUP BY
        ref.[row_number]
ORDER BY
        ref.[row_number];

/*  Test with huge set  */
WITH numbers AS (
    SELECT  TOP(100000000)
            ROW_NUMBER() OVER(
                ORDER BY x1.number
            ) AS [row_number]
    FROM    master.dbo.spt_values x1
            CROSS JOIN (SELECT number FROM master.dbo.spt_values WHERE [type] = 'P' AND number < 676) x2
            CROSS JOIN (SELECT number FROM master.dbo.spt_values WHERE [type] = 'P' AND number < 676) x3
    WHERE   x1.number < 219
)
SELECT  /*  1.- Get the base 26 digit and then shift it to align it to 8 bit boundaries
            2.- Sum the resulting values
            3.- Bias the value with a reference that represent the string 'AAAAAAAA'
            4.- Take the required chars */
        ref.[row_number],
        REVERSE(SUBSTRING(REVERSE(CAST(SUM((((ref.[row_number] - ofst.offset) / ofst.divisor) % 26) * ofst.shifts) +
            CAST(CAST('AAAAAAAA' AS BINARY(8)) AS BIGINT) AS BINARY(8))),
            1, MAX(ofst.chars))) AS string
FROM    numbers ref
        LEFT JOIN #ExponentsLookup ofst
            ON ofst.offset <= ref.[row_number]
            AND ofst.offset_end > ref.[row_number]
GROUP BY
        ref.[row_number]
ORDER BY
        ref.[row_number]
OPTION (QUERYTRACEON 8690);

The trick here is to precompute where the diferent permutations start:

When you have to output a single char, you have 26^1 permutations that start at 26^0.
When you have to output 2 chars you have 26^2 permutations that start at 26^0 + 26^1
When you have to output 3 chars you have 26^3 permutations that start at 26^0 + 26^1 + 26^2
repeat for n chars

The other trick used is to simply use sum to get to the right value instead of trying to concat. To achieve this I simply offset the digits from base 26 to base 256 and add the ascii value of 'A' for each digit. So we obtain the binary representation of the string we're looking for. After that some string manipulations complete the process.

How can I convert the first 100 million positive integers to strings?

Temporary tables

Main code

Execution plan

Tags:

Performance

Sql Server

Query Performance

Related

Recent Posts