Max Number of unique substrings from a partition

This is known as the collision-aware string partition problem and is shown to be NP-complete by a reduction from 3-SAT in a paper by Anne Condon, Ján Maňuch and Chris Thachuk - Complexity of a collision-aware string partition problem and its relation to oligo design for gene synthesis (International Computing and Combinatorics Conference, 265-275, 2008).


You can use a recursive function with a set as a second parameter to keep track of the unique strings in the current path so far. For each recursion, iterate through all the indices plus 1 at which to split the string for a possible candidate string, and if the candidate string is not yet in the set, make a recursive call with the remaining string and the candidate added to the set to get the maximum number of unique substrings from the remaining string, add 1 to it and return the maximum of the maximums from the iterations. Return 0 if either the given string is empty or all the candidate strings are already in the set:

def max_unique_substrings(s, seen=()):
    maximum = 0
    for i in range(1, len(s) + 1):
        candidate = s[:i]
        if candidate not in seen:
            maximum = max(maximum, 1 + max_unique_substrings(s[i:], {candidate, *seen}))
    return maximum

Demo: https://repl.it/@blhsing/PriceyScalySphere

In Python 3.8, the above logic can also be written with a call to the max function with a generator expression that filters candidates that have been "seen" with an assignment expression:

def max_unique_substrings(s, seen=()):
    return max((1 + max_unique_substrings(s[i:], {candidate, *seen}) for i in range(1, len(s) + 1) if (candidate := s[:i]) not in seen), default=0)

(Many thanks to Gilad Barkan (גלעד ברקן) for making me aware of this discussion.)

Let me share my thoughts about this problem from a purely theoretical point of view (note that I also use "factor" instead of "subword").

I think a sufficiently formal definition of the problem (or problems) considered here is the following:

Given a word w, find words u_1, u_2, ..., u_k such that

  • u_i != u_j for every i, j with 1 <= i < j <= k and
  • u_1 u_2... u_k = w

Maximisation variant (we want many u_i): maximize k

Minimisation variant (we want short u_i): minimise max{|u_i| : 1 <= i <= k}

These problems become decision problems by additionally giving a bound B, which, according to whether we are talking about the "many-factors"-variant or the "short factors"-variant, is a lower bound on k (we want at least B factors), or an upper bound on max{|u_i| : 1 <= i <= k} (we want factors of length at most B), respectively. For talking about NP-hardness, we need to talk about decision problems.

Let's use the terms SF for the "short factors"-variant and MF for the "many factors"-variant. In particular, and this is a really crucial point, the problems are defined in such a way that we get a word over some alphabet that is not in any way restricted. The problem version were we know a priori that we only get input words over, say, alphabet {a, b, c, d} is a different problem! NP-hardness does not automatically carry over from the "unrestricted" to the "fixed alphabet" variant (the latter might be simpler).

Both SF and MF are NP-complete problems. This has been shown in [1, 1b] and [2], respectively (as Gilad has pointed out already). If I understand the (maybe too) informal problem definition here at the beginning of this discussion correctly, then the problem of this discussion is exactly the problem MF. It is initially not mentioned that the words are restricted to come from some fixed alphabet, later it is said that we can assume that only lower-case letters are used. If this means that we only consider words over the fixed alphabet {a, b, c, ..., z}, then this would change a lot actually in terms of NP-hardness.

A closer look reveals some differences in complexity of SF and MF:

  1. paper [1, 1b] shows that SF remains NP-complete if we fix the alphabet to a binary one (more precisely: getting a word w over letters a and b and a bound B, can we factorise it in distinct factors of length at most B?).
  2. paper [1, 1b] shows that SF remains NP-complete if we fix the bound B = 2 (more precisely: getting a word w, can we factorise it in distinct factors of length at most 2?).
  3. paper [3] shows that if both the alphabet and the bound B is fixed, then SF can be solved in polynomial-time.
  4. paper [2] shows that MF is NP-complete, but only if the alphabet is not restricted or fixed a priori! In particular, it does not answer the question if the problem is NP-complete if we only consider input words over some fixed alphabet (as is usual the case in practical settings).
  5. paper [3] shows that MF can be solved in polynomial time if the input bounds B are again upper bounded by some constant, i.e., the problem input is a word and a bound B from {1, 2, ..., K}, where K is some fixed constant.

Some comments on these result: W.r.t. (1) and (2), it is intuitively clear that if the alphabet is binary, then, in order to make the problem SF difficult, the bound B cannot be fixed as well. Conversely, fixing B = 2 means that the alphabet size must get rather large in order to produce difficult instances. As a consequence, (3) is rather trivial (in fact, [3] says slightly more: we can then solve it in running time not only polynomial, but also |w|^2 times a factor that only depends on the alphabet size and bound B). (5) is not difficult as well: If our word is long in comparison to B, then we can get the desired factorisation by simply slitting into factors of different lengths. If not, then we can brute-force all possibilities, which is exponential only in B, which in this case is assumed to be a constant.

So the picture we have is the following: SF seems more difficult, because we have hardness even for fixed alphabets or for a fixed bound B. The problem MF, on the other hand, gets poly-time solvable if the bound is fixed (in this regard it is easier than SF), while the corresponding question w.r.t. the alphabet size is open. So MF is slightly less complex than SF, even if it turns out that MF for fixed alphabets is also NP-complete. However, if it can be shown that MF can be solved for fixed alphabets in poly-time, then MF is shown to be much easier than SF... because the one case for which it is hard is somewhat artificial (unbounded alphabet!).

I did put some effort into trying to resolve the case of MF with bounded alphabet, but I was not able to settle it and stopped working on it since. I do not believe that other researchers have tried very hard to solve it (so this is not one of these very hard open problems, many people have already tried and failed; I consider it somehow doable). My guess would be that it is also NP-hard for fixed alphabets, but maybe the reduction is so complicated that you would get something like "MF is hard for alphabets of size 35 or larger" or something, which would not be super nice either.

Regarding further literature, I know the paper [4], which considers the problem of splitting a word w into distinct factors u_1, u_2, ..., u_k that are all palindromes, which is also NP-complete.

I had a quick look at paper [5], pointed out by Gilad. It seems to consider a different setting, though. In this paper, the authors are interested in the combinatorial question of how many distinct subsequences or subwords can be contained in a given word, but these can overlap. For example, aaabaab contains 20 different subwords a, b, aa, ab, ba, bb, aaa, aab, aba, baa, aaab, aaba, abaa, baab, aaaba, aabaa, abaab, aabaab, aaabaa, aaabaab (maybe I miscounted, but you get the idea). Some of them have only one occurrence, like baa, some of them several, like aa. In any case, the question is not how we can somehow split the word in order to get many distinct factors, since this means that each individual symbol contributes to exactly one factor.

Regarding practical solutions to these kind of problems (keep in mind that I am a theoretician, so take this with grain of salt):

  • To my knowledge, there are no theoretical lower bounds (like NP-hardness) that would rule it out to solve MF in polynomial time if we consider only input words over a fixed alphabet. There is one caveat, though: If you get a poly-time algorithm, then this should run exponentially in the number of symbols from the fixed alphabet (or exponential in some function of that)! Otherwise it would also be a polynomial time algorithm for the case of unbounded alphabets. So, being a theoretician, I would be looking for algorithmic tasks that can be computed in time exponential only if the number of symbols and that somehow help to devise an algorithm for MF. On the other hand, it is likely that such an algorithm does not exist and MF is also NP-hard in the fixed-alphabet case.

  • If you are interested in pratical solutions, it might be helpful to approximate the solution. So getting factorisation that are guaranteed to be only half as large as the optimum in the worst-case would not be too bad.

  • Heuristics that do not give a provable approximation ratio, but work well in a practical setting would also be interesting, I guess.

  • Transforming the problem instances into SAT or ILP-instances should not be too difficult and then you could run a SAT or ILP-Solver to even get optimal solutions.

  • My personal opinion is that even though it is not known whether the fixed-alphabet case of MF is NP-hard, there is enough theoretical insights that suggest that the problem is hard enough so that it is justified to look for heuristic solutions etc. that work well in a practical setting.


Bibliography:

[1] Anne Condon, Ján Manuch, Chris Thachuk: The complexity of string partitioning. J. Discrete Algorithms 32: 24-43 (2015)

[1b] Anne Condon, Ján Manuch, Chris Thachuk: Complexity of a Collision-Aware String Partition Problem and Its Relation to Oligo Design for Gene Synthesis. COCOON 2008: 265-275

[2] Henning Fernau, Florin Manea, Robert Mercas, Markus L. Schmid: Pattern Matching with Variables: Fast Algorithms and New Hardness Results. STACS 2015: 302-315

[3] Markus L. Schmid: Computing equality-free and repetitive string factorisations. Theor. Comput. Sci. 618: 42-51 (2016)

[4] Hideo Bannai, Travis Gagie, Shunsuke Inenaga, Juha Kärkkäinen, Dominik Kempa, Marcin Piatkowski, Shiho Sugimoto: Diverse Palindromic Factorization is NP-Complete. Int. J. Found. Comput. Sci. 29(2): 143-164 (2018)

[5] Abraham Flaxman, Aram Wettroth Harrow, Gregory B. Sorkin: Strings with Maximally Many Distinct Subsequences and Substrings. Electr. J. Comb. 11(1) (2004)