Does the fraction of distinct substrings in prefixes of the Thue–Morse sequence of length $2^n$ tend to $73/96$?

It should be easy to derive the conjecture from the results of [1]. In particular, Brlek gives in Proposition 4.2 the precise value of the number $P(n,m)$ of factors of length $m$ of $p_n$ (up to the empty word, which is not included). But more interestingly, he gives a table of the small values of $P_n(m)$. Here is this table (I added the empty word in the first column): \begin{array}{c|cc} n \backslash m & 0& 1 & 2 & 3 & 4 & 5 &6 & 7 & 8 & 9 & 10 & 11 & 12 &13 &14 &15 &16 &17 &18 &19 &20 &21 \\ \hline 1&1&2&1\\ 2&1&2&\mathbf{3}&2&1\\ 3&1&2&4&\mathbf{6}&5&4&3&2&1\\ 4&1&2&4&6&10&\mathbf{12}&11&10&9&8&7&6&5&4&3&2&1\\ 5&1&2&4&6&10&12&16&20&22&\mathbf{24}&23&22&21&20&19&18&17&16&15&14&13&12& \dotsm\\ 6&1&2&4&6&10&12&16&20&22&24&28&32&36&40&42&44&46&\mathbf{48}&47&46&45&44& \dotsm \end{array}

As you can see, there are two types of coefficients in this table. Starting from the coefficients in bold, in position $(k, 2^{k-2} + 1)$ for $k > 0$ (that is $\mathbf{6}$, $\mathbf{12}$, $\mathbf{24}$, $\mathbf{48}$, etc.) the coefficients decrease by $1$ in each line. Thus it is easy to take the sum of these coefficients.

The other coefficients, apart from the first values of $m$, also follow a regular pattern. One has $P(n,m) = P(n-1,m)$ for $m \leqslant 2^{n-3}$. Then the coefficients between $P(n, 2^{n-3} + 1)$ and $P(n, 2^{n-3} + 2^{n-4} + 1)$ form an arithmetic progression of reason $4$ (see $24, 28, 32, 36, 40$ in line 6) and then the coefficients between $P(n, 2^{n-3} + 2^{n-4} + 1)$ and $P(n, 2^{n-2} + 1)$ form an arithmetic progression of reason $2$ (see $40,42,44,46,48$ in line 6).

I am a bit lazy to make the complete computation but, with these observations in hand, it should not be too difficult to sum up the coefficients in each line to get the value of ${\cal D}_n$.

[1] S. Brlek, Enumeration of factors in the Thue-Morse word, Discrete Applied Math. 24 (1989), 83-96.


J.-E. Pin has described the following fact in detail according to Proposition 4.2 in Enumeration of factors in the Thue-Morse word by Srećko Brlek.

Formulas of $P(n,m)$. Let $P(n,m)$ be the number of distinct substrings of length $m$ of $p_n$, $0\le m\le2^n$. We have $$\begin{align} &\begin{array}{c|cccccccc} P_n(m)& m=1 & m=2 & m=3 & m=4 & m=5 &m=6 &m=7 &m=8\\ \hline n=1&2&1\\ n=2&2&3&2&1\\ n=3&2&4&6&5&4&3&2&1\\ \end{array}\\ \text{If } n\ge4,\\ &P_n(m)=\begin{cases} P_{n-1}(m)\quad &\text{ for } m\le2^{n-3}+1,\\ 4(m-1)-2^{n-3}\quad &\text{ for } 2^{n-3}+1\le m\le 2^{n-3} + 2^{n-4}+1,\\ 2^{n-2}+2(m-1)\quad &\text{ for } 2^{n-3} + 2^{n-4}+1\le m\le 2^{n-2}+1,\\ 2^{n} -(m-1)\quad &\text{ for } 2^{n-2}+1\le m.\\ \end{cases} \end{align}$$

As defined in question, $\mathscr D_{n} = \sum_{m=0}^{2^n}p(n,m)$.

Proposition: $\mathscr D_{n} = \dfrac{73\cdot 4^{n-3} + 11}{3}$ for $n\ge3$.
Proof: Let $\mathscr C_{n}=\sum_{m=0}^{2^{n-2}}p(n,m)$. Let us prove $\mathscr C_n=\dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6$ by induction on $n$.

The base case, $\mathscr C_3=7$ can be verified directly.

Suppose it is true for $n$.

$$\begin{align}\mathscr C_{n+1} &= \sum_{m=0}^{2^{n-2}}p(n+1,m)\ +\sum_{m=2^{n-2}+1}^{2^{n-2}+2^{n-3}}p(n+1,m) \ +\sum_{m=2^{n-2}+2^{n-3}+1}^{2^{n-1}}p(n+1,m) \\ &= \sum_{m=0}^{2^{n-2}}p(n,m)\ +\sum_{m=2^{n-2}+1}^{2^{n-2}+2^{n-3}}\left(4(m-1)-2^{n-2}\right)\ +\sum_{m=2^{n-2}+2^{n-3}+1}^{2^{n-1}} \left(2^{n-1}+2(m-1)\right) \\ &=\mathscr C_n+2^{n-3}(-2^{n-2}) +2^{n-3}\cdot2^{n-1}\ +\sum_{m=2^{n-2}+1}^{2^{n-2}+2^{n-3}}4(m-1)\ +\sum_{m=2^{n-2}+2^{n-3}+1}^{2^{n-1}} 2(m-1) \\ &= \mathscr C_n+2^{2n-5} +4\cdot2^{n-3}(2^{n-1}+2^{n-3}-1)/2+2\cdot2^{n-3}(2^{n-1}+2^{n-2}+2^{n-3}-1)/2\\ &= \dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6+19\cdot4^{n-3} -3\cdot2^{n-3}\\ &= \dfrac{38\cdot4^{n-2}-9\cdot2^{n-1}+22}6.\\ \end{align}$$

So we have proved the formula for $\mathscr C_n$. $$\begin{align} \mathscr D_{n} &=\mathscr C_{n} +\sum_{m=2^{n-2}+1}^{2^{n}}P_{n}(m) \\ &= \dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6 + \sum_{m=2^{n-2}+1}^{2^n}2^n-(m-1)\\ &= \dfrac{38\cdot4^{n-3}-9\cdot2^{n-2}+22}6 + (2^n-2^{n-2})(2^{n+1}-2^{n-2}-(2^n-1))/2\\ &= \frac{73\cdot 4^{n-3} + 11}{3}. \quad \blacksquare \end{align}$$


As user125932 points out in this comment, the formula for $\mathscr D_n$ appears in Theorem 14 of on the structure of compacted subword graphs of Thue-Morse words and their applications by Jakub Radoszewski and Wojciech Rytter.

Theorem 14. The number of different factors of $p_n$ for $n\ge4$ equals $\frac{73}{192} |p_n|^2 + \frac83$.

Here factors means non-empty substrings while empty string is counted in $\mathscr D_n$. Note that $|p_n|=2^n$ and $\frac{704}{192}=\frac83+1$.


The formalization can be generalized. Given a string $w$ made of $0$ and $1$, define sequence ${}_wP$, that begins with ${}_wp_0=w$, and ${}_wp_{n+1}$ is ${}_wp_{n}$ followed by its bitwise complement.

  • The Thue-Morse sequence $p_0, p_1, p_2,\cdots$ is just sequence ${}_{0}P$.
  • For example, sequence ${}_{00}P$ is $00, 00\underline{11}, 00\,\underline{11}\,\underline{1100}, \cdots$.
  • For another example, sequence ${}_{01011}P$ is $01011, 01011\,\underline{10100}, 01011\,\underline{10100}\,\underline{1010001011}, \cdots$.

Let ${}_w\mathscr D_n $ be the number of distinct substrings in ${}_wp_n$. This question and answers give the formula for ${}_0\mathscr D_n$. It looks like we also have the following formulas. It might be interesting to prove them and generalize them further.

$$\begin{align} {}_{00}\mathscr D_{n}&=\frac{73\cdot4^{n-2}+11}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{000}\mathscr D_{n}&=219\cdot4^{n-3}+1\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{001}\mathscr D_{n}&=219\cdot4^{n-3}+9\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010}\mathscr D_{n}&=219\cdot4^{n-3}-23\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{0001}\mathscr D_{n}&=\frac{73\cdot4^{n-1}+41}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0100}\mathscr D_{n}&=\frac{73\cdot4^{n-1}+41}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0101}\mathscr D_{n}&=\frac{73\cdot4^{n-1}-13}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01000}\mathscr D_{n}&=\frac{1825\cdot4^{n-3}+59}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01011}\mathscr D_{n}&=\frac{1825\cdot4^{n-3}+59}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010001}\mathscr D_{n}&=219\cdot4^{n-2}+35\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0000001}\mathscr D_{n}&=\frac{3577\cdot4^{n-3}+107}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01010101}\mathscr D_{n}&=\frac{73\cdot4^{n}-157}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{011001111}\mathscr D_{n}&=1971\cdot4^{n-3}+81\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0010011100}\mathscr D_{n}&=\frac{1825\cdot4^{n-2}+323}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{01011010000}\mathscr D_{n}&=\frac{8833\cdot4^{n-3}+371}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{011111100000}\mathscr D_{n}&=219\cdot4^{n-1}+27\color{#d0d0d0}{,\ \text{for}\,\,n\ge2}\\ {}_{0101010101010}\mathscr D_{n}&=\frac{12337\cdot4^{n-3}-2389}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{01010101010111}\mathscr D_{n}&=\frac{3577\cdot4^{n-2}+401}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010101000101111}\mathscr D_{n}&=5475\cdot4^{n-3}+231\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0000010000001111}\mathscr D_{n}&=\frac{73\cdot4^{n+1}+791}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{010110011101010001}\mathscr D_{n}&=1971\cdot4^{n-2}+381\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0101010101010101010}\mathscr D_{n}&=\frac{26353\cdot4^{n-3}-5317}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{0101010101010101111}\mathscr D_{n}&=\frac{26353\cdot4^{n-3}+731}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{001001001001001001001}\mathscr D_{n}&=10731\cdot4^{n-3}-351\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ {}_{0001011000101100010110001011}\mathscr D_{n}&=\frac{3577\cdot4^{n-1}-1021}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge2}\\ {}_{0101010101010101010101010101010101010101010101010}\mathscr D_{n}&=\frac{175273\cdot4^{n-3}-37237}{3}\color{#d0d0d0}{,\ \text{for}\,\,n\ge4}\\ {}_{000000000000000000000000000000000000000000000000000000001}\mathscr D_{n}&=79059\cdot4^{n-3}+2169\color{#d0d0d0}{,\ \text{for}\,\,n\ge3}\\ \end{align}$$