birthday problem - expected number of collisions

The probability person $B$ shares person $A$'s birthday is $1/N$, where $N$ is the number of equally possible birthdays,

so the probability $B$ does not share person $A$'s birthday is $1-1/N$,

so the probability $n-1$ other people do not share $A$'s birthday is $(1-1/N)^{n-1}$,

so the expected number of people who do not have others sharing their birthday is $n(1-1/N)^{n-1}$,

so the expected number of people who share birthdays with somebody is $n\left(1-(1-1/N)^{n-1}\right)$.


I will try to get control of the most standard interpretation of our question by using (at first) very informal language. Let us call someone unhappy if one or more people share his/her "birthday." We want to find the "expected number" of unhappy people.

Define the random variable $X$ by saying that $X$ is the number of unhappy people. We want to find $\text{E}(X)$. Let $p_i$ be the probability that $X=i$. Then $$\text{E}(X)=\sum_{i=0}^{n} i\,p_i$$ That is roughly the approach that you took. That approach is correct, and a very reasonable thing to try. Indeed have been trained to use this approach, since that's exactly how you solved the exercises that followed the definition of expectation.

Unfortunately, in this problem, finding the $p_i$ is very difficult. One could, as you did, decide that for a good approximation, only the first few $p_i$ really matter. That is sometimes true, but depends quite a bit on the values $N$ of "days in the year" and the number $n$ of people.

Fortunately, in this problem, and many others like it, there is an alternative very effective approach. It involves a bit of theory, but the payoff is considerable.

Line the people up in a row. Define the random variables $U_1,U_2,U_3,\dots,U_n$ by saying that $U_k=1$ if the $k$-th person is unhappy, and $U_k=0$ if the $k$-th person is not unhappy. The crucial observation is that $$X=U_1+U_2+U_3+\cdots + U_n$$

One way to interpret this is that you, the observer, go down the line of people, making a tick mark on your tally sheet if the person is unhappy, and making no mark if the person is not unhappy. The number of tick marks is $X$, the number of unhappy people. It is also the sum of the $U_k$.

We next use the following very important theorem: The expectation of a sum is the sum of the expectations. This theorem holds "always." The random variables you are summing need not be independent. In our situation, the $U_k$ are not independent, but, for expectation of a sum, that does not matter. So we have $$\text{E}(X)=\text{E}(U_1) + \text{E}(U_2)+ \text{E}(U_3)+\cdots +\text{E}(U_n)$$

Finally, note that the probability that $U_k=1$ is, as carefully explained by @Henry, equal to $p$, where $$p=1-(1-1/N)^{n-1}$$ It follows that $\text{E}(U_k)=p$ for any $k$, and therefore $\text{E}(X)=np$.


The following approximation may be useful.

If there are $k$ people and $N$ possible birthdays (or in case of a hash table, $k$ items being hashed into $N$ buckets), then the expected number of people/items that collide with at least one of the others is exactly (see Henry's answer or André Nicolas's answer) $$ \begin{align} & k \left(1 - \left(1-\frac1N\right)^{k-1}\right) \\ & = \frac{k(k-1)}{N} - \frac{k(k-1)(k-2)}{2N^2} + O\left(\frac1{N^3}\right) \\ & \approx \frac{k^2}{N}. \end{align}$$


The above is one possible definition of "expected number of collisions". If there are $r$ birthdays/buckets each with two people/items in them, the above expression gives count $2r$, as it counts each member of each pair. If instead you want to count the number of buckets/birthdays that have multiple people in them, then the answer is approximately $$ \approx \frac{k^2}{2N}.$$

This result can be derived either

  • from the previous analysis, by noting that to the first order the most common type of collision is to have 2 in a bucket (3-way and higher collisions will be statistically rare), so you just halve the count;

  • or, by doing a similar analysis focusing on birthdays/buckets: the probability that either $0$ or $1$ of the $k$ people have that particular birthday is $$ \left(1 - \frac1N\right)^k + k\frac1N\left(1 - \frac1N\right)^{k-1}$$ So the expected number of buckets with multiple values in them is $$ \begin{align} & N \left(1 - \left(1 - \frac1N\right)^k - k\frac1N\left(1 - \frac1N\right)^{k-1}\right) \\ & = \frac{k(k-1)}{2N} - \frac{k(k-1)(k-2)}{3N^2} + O\left(\frac1{N^3}\right) \\ & \approx \frac{k^2}{2N}. \end{align}$$