Confidence interval interpretation difficulty

For an analogy, consider the following game. Alice pays Bob five dollars to flip a fair coin. If the coin lands heads, Alice wins ten dollars; if the coin lands tails, Alice wins nothing. Let $W$ be the random variable representing Alice's winnings. Consider the question, "Did Alice win five dollars?" (i.e. "Is $W = +5$?")

Now:

  • before Bob flips the coin, we have: $$P(W = +5) = P(W = -5) = 0.5.$$ So the answer is Yes with probability $0.5$.

But,

  • after Bob flips it, the coin either came up heads, or it came up tails. So $W$ is now either equal to $+5$, or not. The answer is now Yes either with probability $1$, or probability $0$.

This is the case generally: the act of performing an experiment changes probabilities to certainties. Whatever likelihood we assign to an event happening or not happening beforehand, ceases to matter after the experiment has been performed, and the event either did actually happen, or did not actually happen.

Similarly for your question about 95% confidence intervals. When we ask the question, "Does the 95% confidence interval $(L, U)$ contain the true population parameter?" where $L, U$ are the random variables representing the lower and upper endpoints of the interval, then before we take our sample, the answer is Yes with probability $0.95$.

But after we take our sample, $L$ and $U$ are no longer random variables, but have taken specific numerical values. Once the sample is taken and the endpoints are calculated, either $(L, U)$ actually contains the true population parameter, or does not actually contain the true population parameter. So the probability of the answer being Yes is now either $1$ (if it does contain the true parameter) or $0$ (if it does not).


I think a better way to conceptualize confidence intervals (in the frequentist sense) is to first go back to point estimates.

Suppose we calculate a point estimate $W$ for a fixed but unknown parameter $\theta$. This value of $W$ is a statistic: it is a random variable whose value is calculated from the sample $(X_1, \ldots, X_n)$, and does not depend on any unknown parameters. It is random in the sense that it inherits the randomness from the sample, not that the calculation of $W$ from the sample is itself random. For example, we can write a specific formula or rule that calculates $W$ when we have observed $X_1, \ldots, X_n$, but each time we collect a sample, the resulting $W$ we calculate may vary from sample to sample.

As such, we do not have any difficulty understanding that $W$ is an estimate, rather than the true value of $\theta$, which remains unknown to us. We could collect many samples and calculate many different estimates. If the estimator is a "good" one, then if we plotted a histogram of $W$, we would see that $W$ has what is called a sampling distribution, for which most outcomes will tend to "cluster" around the true value of $\theta$.

Now, when we calculate a confidence interval, the idea is to move away from point estimation and talk about pairs of random variables that enclose a range of values that estimate $\theta$. For instance, if we collected ten point estimates and they were $$\{4.2, 4.9, 3.9, 3.75, 4.1, 4.3, 4.45, 3.95, 4.05, 4.5\},$$ this gives us some idea of $\theta$. But ten confidence intervals might look like $$\{(3.7, 4.5), (3.85, 4.65), (4.0, 4.8), (3.9, 4.7), (4.1, 4.9), \\ (4.3, 5.1), (3.35, 4.1), (3.6, 4.4), (4.2, 5.0), (4.25, 5.05)\}.$$ Each time we collect a sample, we calculate two statistics--one for the lower endpoint and one for the upper, with the understanding that their difference incorporates in some sense the underlying variability observed in the sample. But how do we interpret this interval? What does "$95\%$ confidence" mean?

The idea is that in order for us to calculate an interval estimate, we not only need a calculation of the variability in the sample, but we also need to set a criterion called "confidence" that expresses how tolerant we are of the possibility that our estimate might fail to enclose the true value of $\theta$. For example, if we want $99.9\%$ confidence, what this means is on average, we want the chance of the resulting interval we calculate to enclose $\theta$ to be at least this high. Thus, such an interval will be at least as large (and generally speaking much larger) than an interval that has only $90\%$ confidence.

So why don't we ask for $100\%$ confidence intervals? Because except in trivial cases, to be $100\%$ confident you "caught" $\theta$ in your interval, you'd need an infinitely large interval, thus negating the value of computing an estimate at all.

For American audiences, there is a game show called "The Price is Right." One of the games they play with contestants is called the Range game. The host shows the contestant a car, and the contestant needs to guess the price of the car in order to win it. But the guess isn't a point estimate--the contestant doesn't have to guess the exact value. Instead, they watch a transparent red slider of fixed width that moves steadily upward over a vertical chart of prices, and they press a button when they think that the slider is covering the true price of the car to stop the slider. Once it is stopped, the edges of the slider indicates the range of prices that the contestant believes the true price is contained within.

This game is exactly analogous to how confidence intervals work. The true price is fixed but unknown to the contestant. The act of pressing the button is equivalent to calculating the confidence interval. If you were to play this game with the same car but for many different contestants, each one would stop the slider at a slightly different point. Not all of the contestants would win; some would miss. The proportion of contestants who win is the confidence level. If the game show made the slider huge, then it would be easier to win, but the precision of the price estimate suffers. Conversely, if the slider were made tiny, it would be hard to win but the estimate has better precision.