Determine the Size of a Test Bank

This can be solved by 'capture-recapture' or 'mark-recapture' methods of estimating population size. One person is 'capture' and the other is 'recapture'. The 'Chapman' estimator (see Wikipedia on 'mark recapture') in this case is $\hat N_C = (30 + 1)(30 + 1)/(7 + 1) -1 \approx 119.$ Based on a hypergeometric model, this estimator is nearly unbiased. The Wikipedia gives two methods for finding a corresponding confidence interval.

The older and simpler 'Lincoln-Peterson' estimator is simply $\hat N = 30^2/7 \approx 128.$ It gives an infinite value if there happen to be no repeated questions. Thus $E(\hat N)$ does not exist, and one cannot discuss the unbiassedness of this estimator.

Addendum: The comments and the answer by @GregoryGrant are using the Lincoln-Peterson estimator, which is the maximum likelihood estimator, based on knowledge that there are 7 coincidences. Here is some relevant R code and a figure:

 N = 100:150
 like = choose(30,7)*choose(N-30, 30-7)/choose(N, 30)
 N[like==max(like)]  # value of N that maximizes 'like'
 ## 128
 plot(N, like, pch=20);  abline(v=128, lty="dotted")

enter image description here

Note: Here is one method to get an analytic solution for the maximum: Let $f(N|7) = {30 \choose 7}{N-30 \choose 23}/{N \choose 30}.$ Then look at $f(N|7)/f(N-1|7),$ simplifying it with lots of cancellation. Then notice the behavior of the ratio.


The minimum number in the pool must be $53$. Suppose there are $n$ in total.

So it's like if you had an urn with $n$ balls, $30$ are white and $n-30$ are red. Then you pull $30$ balls at random. You want to know how many of the balls you pulled are white. Or more specifically you want to know the probability that $7$ of the $30$ you pull are white.

Let $A$ be the number of white balls. Then $P(A=k)$ is hypergeometric and equal to

$\frac{{{30}\choose{k}}{ {n-30}\choose{30-k}}}{{n}\choose{30}}$

So in your case:

$\frac{{{30}\choose{7}}{{n-30}\choose{23}}}{{n}\choose{30}}$

This is the probability of an overlap of exactly $7$.

You now need to find the $n$ that maximizes that probability.

If you start plugging in numbers (using a calculator) starting at $n=53$ you'll probably see that it goes up and then soon starts to go back down. Choose the max before it starts going back down. Shouldn't be too much larger than 53. I'm guessing somewhere around 100.