Given two sequences, find the maximal overlap between ending of one and beginning of the other

You can utilize the z algorithm, a linear time (O(n)) algorithm that:

Given a string S of length n, the Z Algorithm produces an array Z where Z[i] is the length of the longest substring starting from S[i] which is also a prefix of S

You need to concatenate your arrays (b+a) and run the algorithm on the resulting constructed array till the first i such that Z[i]+i == m+n.

For example, for a = [1, 2, 3, 6, 2, 3] & b = [2, 3, 6, 2, 1, 0], the concatenation would be [2, 3, 6, 2, 1, 0, 1, 2, 3, 6, 2, 3] which would yield Z[10] = 2 fulfilling Z[i] + i = 12 = m + n.

For O(n) time/space complexity, the trick is to evaluate hashes for each subsequence. Consider the array b:

[b1 b2 b3 ... bn]

Using Horner's method, you can evaluate all the possible hashes for each subsequence. Pick a base value B (bigger than any value in both of your arrays):

from b1 to b1 = b1 * B^1
from b1 to b2 = b1 * B^1 + b2 * B^2
from b1 to b3 = b1 * B^1 + b2 * B^2 + b3 * B^3
...
from b1 to bn = b1 * B^1 + b2 * B^2 + b3 * B^3 + ... + bn * B^n

Note that you can evaluate each sequence in O(1) time, using the result of the previous sequence, hence all the job costs O(n).

Now you have an array Hb = [h(b1), h(b2), ... , h(bn)], where Hb[i] is the hash from b1 until bi.

Do the same thing for the array a, but with a little trick:

from an to an   =  (an   * B^1)
from an-1 to an =  (an-1 * B^1) + (an * B^2)
from an-2 to an =  (an-2 * B^1) + (an-1 * B^2) + (an * B^3)
...
from a1 to an   =  (a1   * B^1) + (a2 * B^2)   + (a3 * B^3) + ... + (an * B^n)

You must note that, when you step from one sequence to another, you multiply the whole previous sequence by B and add the new value multiplied by B. For example:

from an to an =    (an   * B^1)

for the next sequence, multiply the previous by B: (an * B^1) * B = (an * B^2)
now sum with the new value multiplied by B: (an-1 * B^1) + (an * B^2) 
hence:

from an-1 to an =  (an-1 * B^1) + (an * B^2)

Now you have an array Ha = [h(an), h(an-1), ... , h(a1)], where Ha[i] is the hash from ai until an.

Now, you can compare Ha[d] == Hb[d] for all d values from n to 1, if they match, you have your answer.

ATTENTION: this is a hash method, the values can be large and you may have to use a fast exponentiation method and modular arithmetics, which may (hardly) give you collisions, making this method not totally safe. A good practice is to pick a base B as a really big prime number (at least bigger than the biggest value in your arrays). You should also be careful as the limits of the numbers may overflow at each step, so you'll have to use (modulo K) in each operation (where K can be a prime bigger than B).

This means that two different sequences might have the same hash, but two equal sequences will always have the same hash.

This can indeed be done in linear time, O(n), and O(n) extra space. I will assume the input arrays are character strings, but this is not essential.

A naive method would -- after matching k characters that are equal -- find a character that does not match, and go back k-1 units in a, reset the index in b, and then start the matching process from there. This clearly represents a O(n²) worst case.

To avoid this backtracking process, we can observe that going back is not useful if we have not encountered the b[0] character while scanning the last k-1 characters. If we did find that character, then backtracking to that position would only be useful, if in that k sized substring we had a periodic repetition.

For instance, if we look at substring "abcabc" somewhere in a, and b is "abcabd", and we find that the final character of b does not match, we must consider that a successful match might start at the second "a" in the substring, and we should move our current index in b back accordingly before continuing the comparison.

The idea is then to do some preprocessing based on string b to log back-references in b that are useful to check when there is a mismatch. So for instance, if b is "acaacaacd", we could identify these 0-based backreferences (put below each character):

index: 0 1 2 3 4 5 6 7 8
b:     a c a a c a a c d
ref:   0 0 0 1 0 0 1 0 5

For example, if we have a equal to "acaacaaca" the first mismatch happens on the final character. The above information then tells the algorithm to go back in b to index 5, since "acaac" is common. And then with only changing the current index in b we can continue the matching at the current index of a. In this example the match of the final character then succeeds.

With this we can optimise the search and make sure that the index in a can always progress forwards.

Here is an implementation of that idea in JavaScript, using the most basic syntax of that language only:

function overlapCount(a, b) {
    // Deal with cases where the strings differ in length
    let startA = 0;
    if (a.length > b.length) startA = a.length - b.length;
    let endB = b.length;
    if (a.length < b.length) endB = a.length;
    // Create a back-reference for each index
    //   that should be followed in case of a mismatch.
    //   We only need B to make these references:
    let map = Array(endB);
    let k = 0; // Index that lags behind j
    map[0] = 0;
    for (let j = 1; j < endB; j++) {
        if (b[j] == b[k]) {
            map[j] = map[k]; // skip over the same character (optional optimisation)
        } else {
            map[j] = k;
        }
        while (k > 0 && b[j] != b[k]) k = map[k]; 
        if (b[j] == b[k]) k++;
    }
    // Phase 2: use these references while iterating over A
    k = 0;
    for (let i = startA; i < a.length; i++) {
        while (k > 0 && a[i] != b[k]) k = map[k];
        if (a[i] == b[k]) k++;
    }
    return k;
}

console.log(overlapCount("ababaaaabaabab", "abaababaaz")); // 7

Although there are nested while loops, these do not have more iterations in total than n. This is because the value of k strictly decreases in the while body, and cannot become negative. This can only happen when k++ was executed that many times to give enough room for such decreases. So all in all, there cannot be more executions of the while body than there are k++ executions, and the latter is clearly O(n).

To complete, here you can find the same code as above, but in an interactive snippet: you can input your own strings and see the result interactively:

function overlapCount(a, b) {
    // Deal with cases where the strings differ in length
    let startA = 0;
    if (a.length > b.length) startA = a.length - b.length;
    let endB = b.length;
    if (a.length < b.length) endB = a.length;
    // Create a back-reference for each index
    //   that should be followed in case of a mismatch.
    //   We only need B to make these references:
    let map = Array(endB);
    let k = 0; // Index that lags behind j
    map[0] = 0;
    for (let j = 1; j < endB; j++) {
        if (b[j] == b[k]) {
            map[j] = map[k]; // skip over the same character (optional optimisation)
        } else {
            map[j] = k;
        }
        while (k > 0 && b[j] != b[k]) k = map[k]; 
        if (b[j] == b[k]) k++;
    }
    // Phase 2: use these references while iterating over A
    k = 0;
    for (let i = startA; i < a.length; i++) {
        while (k > 0 && a[i] != b[k]) k = map[k];
        if (a[i] == b[k]) k++;
    }
    return k;
}

// I/O handling

let [inputA, inputB] = document.querySelectorAll("input");
let output = document.querySelector("pre");

function refresh() {
    let a = inputA.value;
    let b = inputB.value;
    let count = overlapCount(a, b);
    let padding = a.length - count;
    // Apply some HTML formatting to highlight the overlap:
    if (count) {
        a = a.slice(0, -count) + "<b>" + a.slice(-count) + "</b>";
        b = "<b>" + b.slice(0, count) + "</b>" + b.slice(count);
    }
    output.innerHTML = count + " overlapping characters:\n" + 
                         a + "\n" + 
                         " ".repeat(padding) + b;
}

document.addEventListener("input", refresh);
refresh();

body { font-family: monospace }
b { background:yellow }
input { width: 90% }

a: <input value="acacaacaa"><br>
b: <input value="acaacaacd"><br>
<pre></pre>

Given two sequences, find the maximal overlap between ending of one and beginning of the other

Tags:

Algorithm

Sequence

Related

Recent Posts