Search for bitstring most unlike a set of bitstrings

This question is in part algorithmic (what's the best algorithm to get to the solution) and in part a Python question (on what parts of Python to use to then efficiently implement that best algorithm).

On the algorithm: you define the max distance for a bit string to a set of (same size) bit strings to be the largest number of bits the target bit string differs on with any of the strings in the set. The goal of the algorithm is to find a new bit string with the same length as the strings in the set that has the lowest max distance.

It is assumed all the starting bit strings are different (as it is defined as a set, not a list). The distance you're computing is known as the Hamming distance, so you're looking for a new bit string with minimum Hamming distance to a starting set of strings.

Generating all possible bit strings of the right length and calculating the max distance to each starting string is brute forcing the problem, which can be optimised(*) using backtracking (discarding a result as soon as the lowest current max is exceeded for a candidate bit string).

(*: for people looking to correct my spelling, please consider the fact that I'm using UK English, not US English - feel free to propose improvements with that in mind)

However, the problem can also be viewed as follows.

For bit strings of length 1, the entire space of strings has only two options, {'0', '1'}. This can be visualised as '0' and '1' sitting at either end of a line segment of length 1, both a distance of 1 away from each other.

For bit strings of length 2, the entire space of strings has 4 options, namely the bit representations of 0 through 3 {'00', '01', '10', '11'} 0 is distance 1 away from 1 and 2, both of which are distance 1 away from 3. When visualised, they form the four corners of a square, none of them ever more than 2 steps from any other.

For bit strings of length 3, the entire space has 8 options, namely the bit representations of 0 through 7. When visualised, the form the 8 corners of a cube, none of them ever more than 3 steps from any other.

This pattern continues (into a 4D hypercube, 5D, etc.) and finding the answer to the problem effectively translates into: given a set of corners on one of these graphs, find the point with the lowest maximum distance to any of them.

An algorithm to find such a point, given a graph like that would be to:

Start with a list of the points in a set by themselves; if there's only one, that's the trivial answer.
Set the current distance to 1.
For all of the sets, add to it any point only one step away from points already in the set.
Intersect all of the resulting sets; if the intersection is not empty, these are all of the points that are the current distance (or less) away from the starting set of points; otherwise, increase the current distance by 1 and go to step 3.

This could probably be optimised further by keeping track of points visited as they are added to sets (for long bit strings), to avoid adding the same points over and over, quickly slowing down the given algorithm. I.e. instead of turning {'001'} into {'001', '101', '011', '000'}, you could go from [{'001'}] to [{'001'}, {'101', '011', '000'}] - the union of the sets still gets you all of the points reachable within 1 or fewer steps, but the next step in the series would be easier to compute, by finding all points that are one step farther out, but excluding points in the previous direction.

Finding points one step out is actually quite simple, if you turn the strings into the numbers the represent and calculate the exclusive bit-wise or of the numbers with all of the single '1'-bit numbers with the same bit string length, i.e. to find all points one step away from '001', you can xor 1 with 4, 2 and 1, yielding {5, 3, 0}, matching the correct points.

Putting all of that together in a tight bit of Python (without the optimisation for longer strings):

def closest(strings):
    if len(strings) == 1:
        return next(iter(strings))

    size = len(next(iter(strings)))
    points = [{int(s, 2)} for s in strings]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        points = [{n ^ p for p in powers for n in nums} | nums for nums in points]
        intersection = set.intersection(*points)
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}


print(closest({'1000', '0001', '0011'}))

Note that closest returns the actual distance and all of the optimal answers, not just one. Output:

(2, {'0000', '0010', '1001', '0001', '1011'})

Adding the discussed optimisation to closest:

def closest_optimised(strings):
    if len(strings) == 1:
        return next(iter(strings))

    size = len(next(iter(strings)))
    points = [({int(s, 2)}, {int(s, 2)}) for s in strings]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        new_points = [{n ^ p for p in powers for n in rp} - ap for ap, rp in points]
        points = [(ap | np, np) for (ap, _), np in zip(points, new_points)]
        intersection = set.intersection(*[ap for ap, _ in points])
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}

Note that running this through a profiler has the optimised code running in about half the time on average for these settings:

from random import randint

s = 10
x = 500
numbers = [randint(0, 2**s-1) for _ in range(x)]
number_strings = {f"{n:b}".zfill(s) for n in numbers}
print(number_strings)
print(closest_optimised(number_strings))
print(closest(number_strings))

Edit: Out of curiosity, I ran my example against the original given in the question and found that it frequently returns a far from optimal result. I haven't tried to find out why, but I figured it bears mentioning.

Someone pointed out that the OP may actually want the point with the maximum Hamming distance to all provided bit strings. With a similar approach:

def farthest(strings):
    s = next(iter(strings))
    size = len(s)
    if len(strings) == 1:
        return ''.join(['0' if c == '1' else '1' for c in s])

    all_visited = {int(s, 2) for s in strings}
    visited = [set(), all_visited]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        visited.append({n ^ p for p in powers for n in visited[-1]} - all_visited)
        all_visited = all_visited | visited[-1]
        if len(all_visited) == 2**size:
            return d, {f"{n:b}".zfill(size) for n in visited[-1]}

Search for bitstring most unlike a set of bitstrings

Tags:

Python

Search

Python 3.X

Bit

Hamming Distance

Related

Recent Posts