Generate an integer that is not among four billion given ones

Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.

If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.

If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.

Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 2¹⁶ times. Do a second pass to find of which of the possible numbers in that bucket are used already.

If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.

If "integer" means mathematical integer: Read through the input once and keep track of the ~~largest number~~ length of the longest number you've ever seen. When you're done, output ~~the maximum plus one~~ a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).

A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10

Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :) Coming to the problem, 4 billion numbers can be represented in :

4 billion bits = (4000000000 / 8) bytes = about 0.466 GB

The code to implement the bitset is simple: (taken from solutions page )

#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];

void set(int i) {        a[i>>SHIFT] |=  (1<<(i & MASK)); }
void clr(int i) {        a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int  test(int i){ return a[i>>SHIFT] &   (1<<(i & MASK)); }

Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.

If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.

HTH.

Generate an integer that is not among four billion given ones

Tags:

Algorithm

File

Search

Out Of Memory

Memory Limit

Related

Recent Posts