What's the fastest way to generate a 1 GB text file containing random digits?

This:

 LC_ALL=C tr '\0-\377' \
             '[0*25][1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][x*]' \
    < /dev/urandom |
    tr -d x |
    fold -w 1 |
    paste -sd "$(printf '%99s\\n')" - |
    head -c1G

(assuming a head implementation that supports -c) appears to be reasonably fast on my system.

tr translates the whole byte range (0 to 255, 0 to 0377 in octal): the 25 first bytes as 0, the 25 next ones as 1... then 25 9 the rest (250 to 255) to "x" which we then discard (with tr -d x) as we want a uniform distribution (assuming /dev/urandom has a uniform distribution itself) and so not give a bias to some digits.

That produces one digit for 97% of the bytes of /dev/urandom. fold -w 1 makes it one digit per line. paste -s is called with a list of separators that consists on 99 space characters and one newline character, so to have 100 space separated digits on each line.

head -c1G will get the first GiB (230) of that. Note that the last line will be truncated and undelimited. You could truncate to 230-1 and add the missing newline by hand, or truncate to 109 bytes instead which is 50 million of those 200 byte lines (head -n 50000000 would also make it a standard/portable command).

These timings (obtained by zsh on a quad-core system), give an indication of where the CPU time is spent:

LC_ALL=C tr '\0-\377'  < /dev/urandom  0.61s user 31.28s system 99% cpu 31.904 total
tr -d x  1.00s user 0.27s system 3% cpu 31.903 total
fold -w 1  14.93s user 0.48s system 48% cpu 31.902 total
paste -sd "$(printf '%99s\\n')" -  7.23s user 0.08s system 22% cpu 31.899 total
head -c1G > /dev/null  0.49s user 1.21s system 5% cpu 31.898 total

The first tr is the bottle neck, most of the time spent in the kernel (I suppose for the random number generation). The timing is roughly in line with the rate I can get bytes from /dev/uramdom (about 19MiB/s and here we produce 2 bytes for each 0.97 byte of /dev/urandom at a rate of 32MiB/s). fold seems to be spending an unreasonable amount of CPU time (15s) just to insert a newline character after every byte but that doesn't affect the overall time as it works on a different CPU in my case (adding the -b option makes it very slightly more efficient, dd cbs=1 conv=unblock seems like a better alternative).

You can do away with the head -c1G and shave off a few seconds by setting a limit on the file size (limit filesize 1024m with zsh or ulimit -f "$((1024*1024))" with most other shells (including zsh)) instead in a subshell.

That could be improved if we extracted 2 digits for each byte, but we would need a different approach for that. The above is very efficient because tr just looks up each byte in a 256 byte array. It can't do that for 2 bytes at a time, and using things like hexdump -e '1/1 "%02u"' that computes the text representation of a byte using more complex algorithms would be more expensive than the random number generation itself. Still, if like in my case, you have CPU cores whose time to spare, it may still manage to shave off a few seconds:

With:

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -n250000000 -ve '500/1 "%02u" "\n"' |
  fold -w1 |
  paste -sd "$(printf '%99s\\n')" - > /dev/null

I get (note however that here it's 1,000,000,000 bytes as opposed to 1,073,741,824):

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 18.83s system 70% cpu 27.001 total
tr -d x  2.17s user 0.09s system 8% cpu 27.000 total
hexdump -n250000000 -ve '500/1 "%02u" "\n"'  26.79s user 0.17s system 99% cpu 27.000 total
fold -w1  14.42s user 0.67s system 55% cpu 27.000 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  8.00s user 0.23s system 30% cpu 26.998 total

More CPU time overall, but better distributed between my 4 CPU cores, so it ends up taking less wall-clock time. The bottleneck is now hexdump.

If we use dd instead of the line-based fold, we can actually reduce the amount of work hexdump needs to do and improve the balance of work between CPUs:

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

(here assuming GNU dd for its iflag=fullblock and status=none) which gives:

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 15.58s system 99% cpu 15.915 total
tr -d x  1.62s user 0.16s system 11% cpu 15.914 total
hexdump -ve '"%02u"'  10.90s user 0.32s system 70% cpu 15.911 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  5.44s user 0.19s system 35% cpu 15.909 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  5.50s user 0.30s system 36% cpu 15.905 total

Back to the random-number generation being the bottleneck.

Now, as pointed out by @OleTange, if you have the openssl utility, you could use it to get a faster (especially on processors that have AES instructions) pseudo-random generator of bytes.

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom

on my system spews 15 times as many bytes per second than /dev/urandom. (I can't comment on how it compares in terms of cryptographically secure source of randomness if that applies to your use case).

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

Now gives:

openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom < /dev/zero 2>   1.13s user 0.16s system 12% cpu 10.174 total
LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]'  0.56s user 0.20s system 7% cpu 10.173 total
tr -d x  2.50s user 0.10s system 25% cpu 10.172 total
hexdump -ve '"%02u"'  9.96s user 0.19s system 99% cpu 10.172 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  4.38s user 0.20s system 45% cpu 10.171 total
paste -sd "$(printf '%99s\\n')" - > /dev/null

back to hexdump being the bottleneck.

As I still have CPUs to spare, I can run 3 of those hexdump in parallel.

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  (hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"') 3<&0 |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

(the <&3 is needed for shells other than zsh that close commands' stdin on /dev/null when run in background).

Now down to 6.2 seconds and my CPUs almost fully utilised.


This is partially a tongue-in-cheek answer, because of the title of the question.

When you look for "the fastest way to ...", the answer is almost always some specialized tool. This "answers" shows one such tool, just so you can experiment.

This is not a serious answer, because you should not look into specialized tools for jobs you only do once, or very rarely. You see, you'll end up spending more time looking for tools and learning about them, than actually doing stuff. Shells and utilities like bash and awk are not the fastest, but you can usually write a one-liner to achieve the job, spending only seconds. Better scripting languages like perl can also be used, although the learning curve for perl is steep, and I hesitate to recommend it for such purposes, because I've been traumatized by awful perl projects. python on the other hand is slightly handicapped by its rather slow I/O; it is only an issue when you filter or generate gigabytes of data, though.

In any case, the following C89 example program (which uses POSIX.1 for higher accuracy clock only if available) should achieve about 100 MB/s generation rate (tested in Linux on a laptop with an Intel i5-4200U processor, piping the output to /dev/null), using a pretty good pseudo-random number generator. (The output should pass all the BigCrunch tests, except the MatrixRank test, as the code uses xorshift64* and the exclusion method to avoid biasing the digits.)

decimal-digits.c:

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

/* This program is licensed under the CC0 license,
       https://creativecommons.org/publicdomain/zero/1.0/
   In other words, this is dedicated to the public domain.
   There are no warranties either, so if something breaks,
   you only have yourself to blame.
*/

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    for (line = 0UL; line < lines; line++) {
        fputc('0' + digit(), stdout);
        for (col = 1UL; col < cols; col++) {
            fputc(' ', stdout);
            fputc('0' + digit(), stdout);
        }
        fputc('\n', stdout);

        /* Check for write errors. */
        if (ferror(stdout))
            return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

We can make it a lot faster, if we switch to a line buffer, and fwrite() it once instead of outputting each digit at a time. Note that we still keep the stream fully buffered, to avoid partial (non-power-of-two) writes if the output is a block device.

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    char         *oneline;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    /* Allocate memory for a full line. */
    oneline = malloc((size_t)(2 * cols + 1));
    if (!oneline) {
        fprintf(stderr, "Not enough memory for %lu column buffer.\n", cols);
        return EXIT_FAILURE;
    }

    /* Set spaces and terminating newline. */
    for (col = 0; col < cols; col++)
        oneline[2*col + 1] = ' ';
    oneline[2*cols-1] = '\n';

    /* Not needed, but in case a code modification treats it as a string. */
    oneline[2*cols] = '\0';

    for (line = 0UL; line < lines; line++) {
        for (col = 0UL; col < cols; col++)
            oneline[2*col] = digit();

        if (fwrite(oneline, 2*cols, 1, stdout) != 1)
            return EXIT_FAILURE; 
    }

    /* Check for write errors. */
    if (ferror(stdout))
        return EXIT_FAILURE;

    return EXIT_SUCCESS;
}

Note: both examples edited on 2016-11-18 to ensure uniform distribution of digits (zero is excluded; see e.g. here for comparison and details on various pseudo-random number generators).

Compile using for example

gcc -Wall -O2 decimal-digits.c -o decimal-digits

and optionally install system-wide to /usr/bin using

sudo install -o root -g root -m 0755 decimal-digits /usr/bin

It takes the number of digits per line, and the number of lines. Because 1000000000 / 100 / 2 = 5000000 (five million; total bytes divided by columns divided by 2), you can use

./decimal-digits 100 5000000 > digits.txt

to generate the gigabyte-sized digits.txt as desired by the OP.

Note that the program itself is written more with readability than efficiency in mind. My intent here is not to showcase the efficiency of the code — I'd use POSIX.1 and low-level I/O anyway, rather than generic C interfaces — but to let you easily see what kind of balance there is with effort spent in developing dedicated tools versus their performance, compared to one-liners or short shell or awk scriptlets.

Using the GNU C library, calling the fputc() function for every character output incurs a very small overhead (of an indirect function call, or conditionals -- the FILE interface is actually pretty complex and versatile, you see). On this particular Intel Core i5-4200U laptop, redirecting the output to /dev/null, the first (fputc) version takes about 11 seconds, whereas the line-at-a-time version takes just 1.3 seconds.

I happen to often write such programs and generators only because I like to play with huge datasets. I'm weird that way. For example, I once wrote a program to print all finite positive IEEE-754 floating-point values into a text file, with sufficient precision to yield the exact same value when parsed. The file was a few gigabytes in size (perhaps 4G or so); there are not that many finite positive floats as one might think. I used this to compare implementations that read and parse such data.

For normal use cases, like the OP is having, shell scripts and scriptlets and one-liners are the better approach. Less time spent to accomplish the overall task. (Except if they need a different file every day or so, or there are many people who need a different file, in which — rare — case, a dedicated tool like above, might warrant the effort spent.)


If you have shuf available (recent GNU coreutils does) you can do this:

time shuf -r -n $((512*1024*1024)) -i 0-9 | paste -sd "$(printf '%99s\\n')" -

On my VM, this is now a bit slower than Stéphane's answer by about a 3:4 factor.

Tags:

Random

Files