Finding the expected value in the given problem.

Note that $E[X+Y]=E[X]+E[Y]$ holds in full generality, even if $X$ and $Y$ are not mutually independent. Proofs of linearity of expectation do not assume independence of $X$ and $Y$. Here's one for example. In other words you do not need to impose that restriction.


To simplify the subscripts, I'm going to consider just the first few letters the monkey types. Add a variable $k$ to all my subscripts if you want to apply the reasoning at an arbitrary point in the string.

It is true that $E(X_1) = 26^{-5}$ and also that $E(X_2) = 26^{-5}$. We just have to cycle through the $26^6$ equally-likely possibilities for the first six letters the monkey types. Of these, $26$ are strings of the form "proof_" and another $26$ are of the form "_proof", where the blank is filled in with some letter from a to z.

What you've observed is that when we condition the expectation of $X_2$ on the value of $X_1$, we get a result that is different from the ordinary (not conditioned) expectation. Specifically, $E(X_2 \mid X_1 = 1) = 0$, which is less than $26^{-5}$.

Since the method you used applies a theorem of probability whose required assumptions are satisfied by the assumptions of your question (in particular, the expectation of each $X_i$ exists), we should find that there is something else going on that will somehow "balance out" the fact that the observation $X_1 = 1$ lowers the expected value of $X_2$.

And in fact there is something else going on.

We only observe $X_1 = 1$ once, on average, for each $26^5$ times we let a monkey type a million letters. The other $26^5 - 1$ times (on average) that we do this, we observe $X_1 = 0$. In just $26$ of those $26^5 - 1$ times (on average), the first five letters typed by the monkey will be a string of the form "_proo", where the blank is filled by a letter from a to z. In those cases, there is a $1/26$ probability (conditioned on the observed data) that the sixth letter will be f and that $X_2$ will be $1$, that is, $$P(X_2 = 1 \mid \text{letters 2 through 5 are "proo"}) = 1/26.$$ In the other $26^5 - 27$ cases, there is zero probability (conditioned on the observed data) that $X_2 = 1$.

Let $A$ be the event that first five letters have the form "_proo", $B$ the event that the first five letters are neither "proof" nor anything of the form "_proo". Then the expectation of $X_2$ conditioned on the observation $X_1 = 0$ is \begin{align} E(X_2 \mid X_1 = 0) &= 0 \cdot P(X_2 = 0 \mid A) \, P(A \mid X_1 = 0) \\ & \qquad {} + 1 \cdot P(X_2 = 1 \mid A) \, P(A \mid X_1 = 0) \\ & \qquad {} + 0 \cdot P(X_2 = 0 \mid B) \, P(B \mid X_1 = 0) \\ & \qquad {} + 1 \cdot P(X_2 = 1 \mid B) \, P(B \mid X_1 = 0) \\ &= P(X_2 = 1 \mid A) \, P(A \mid X_1 = 0) \\ & \qquad {} + P(X_2 = 1 \mid B) \, P(B \mid X_1 = 0) \\ &= \frac{1}{26} \left( \frac{26}{26^5 - 1} \right) + 0 \cdot P(B \mid X_1 = 0) \\ &= \frac{1}{26^5 - 1} \end{align}

This is ever so slightly greater than the unconditional expectation, $26^{-5}$. If fact it is just large enough so that \begin{align} E(X_2) &= E(X_2 | X_1 = 0)\, P(X_1 = 0) + E(X_2 | X_1 = 1)\, P(X_1 = 1) \\ &= \frac{1}{26^5 - 1} \left( \frac{26^5 - 1}{26^5} \right) + 0 \cdot P(X_1 = 1) \\ &= 26^{-5}. \end{align}

In summary, the fact that $X_1 = 1$ forces $X_2 = 0$ is balanced by the fact that $X_1 = 0$ gives a tiny boost to the probability that $X_2 = 1$.

You could do a similar analysis for the effect of $X_1$ on $X_3$, $X_4$, and $X_5$.

The total effect is that while an occurrence of "proof" at one position rules out occurrences at several nearby positions, each place where "proof" does not occur causes "proof" to be a little more likely than usual to occur in nearby positions. For example, "proof" at position $1$ rules out "proof" at position $5$, but it raises the probability from $26^{-5}$ to $26^{-4}$ that the string at position $5$ will be "fproo", which in turn gives a relatively very high probability ($1/26$) that "proof" will occur at position $6$.