Expected number of times random substring occurs inside of larger random string

You miscounted by $1$: A string of length $m$ fits into a string of length $n$ once, not $0$ times, so it should be $(n-m+1)$.

The problem is essentially the same if the probabilities differ. By linearity of expectation, you still only have to take $n-m+1$ times the probability for a single match. If you have $k$ letters $a_1$ through $a_k$ that occur with probabilities $p_1$ through $p_k$, the expected number of matches is

$$ (n-m+1)\left(\sum_{i=1}^kp_i^2\right)^m\;. $$

This is too long for a comment to the answer by @joriki, but it's intended to fill in some details for those commenters who questioned whether the sum of squared probabilities is correct (it is).

For an alphabet of $k$ letters $\{a_1\ldots a_k\}$, we're given two random words on that alphabet: $X_1\ldots X_m$ and $Y_1\ldots Y_n,\ \ m<n$, and all of these random variables are i.i.d. Then the required expectation is found as follows, where $[...]$ are Iverson brackets:

$$E\bigg(\text{number of occurrences of }X_1\ldots X_m\text{ in }Y_1\ldots Y_n\bigg)\\ \begin{align} &= E\bigg(\sum_{i=1}^{n-m+1}[Y_i\ldots Y_{i+m-1}=X_1\ldots X_m]\bigg)\\ &= \sum_{i=1}^{n-m+1}P(Y_i\ldots Y_{i+m-1}=X_1\ldots X_m)\\ &=(n-m+1)\ P(Y_1\ldots Y_m=X_1\ldots X_m)\\ &=(n-m+1)\ P\bigg(\bigcap_{i=1}^m \{Y_i=X_i\}\bigg)\\ &=(n-m+1)\prod_{i=1}^m P(Y_i=X_i)\\ &=(n-m+1)\big( P(Y_1=X_1)\big)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k P(\{Y_1=a_i\}\cap\{X_1=a_i\}\bigg)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k P(Y_1=a_i)P(X_1=a_i)\bigg)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k p_i^2\bigg)^m\\ \end{align} $$

If the shorter word were fixed instead of random, say $x_1\ldots x_m$, then similar derivation gives

$$E\bigg(\text{number of occurrences of }x_1\ldots x_m\text{ in }Y_1\ldots Y_n\bigg)\\ \begin{align} &=\qquad ...\\ &=(n-m+1)\prod_{i=1}^m P(Y_i=x_i)\\ &=(n-m+1)\prod_{i=1}^m p(x_i)\\ \end{align} $$

where $p()$ is the probability mass function for the given distribution on the finite alphabet.

Expected number of times random substring occurs inside of larger random string

Tags:

Probability

Related

Recent Posts