Is there a mathematical basis for the idea that this interpretation of confidence intervals is incorrect, or is it just frequentist philosophy?

We need to distinguish between two claims here:

  1. Population parameters cannot be random, only the data we obtain about them can be random.
  2. Interpreting confidence intervals as containing a parameter with a certain probability is wrong.

The first is a sweeping statement that you correctly describe as frequentist philosophy (in some cases “dogma” would seem more appropriate) and that you don't need to subscribe to if you find a subjectivist interpretation of probabilities to be interesting, useful or perhaps even true. (I certainly find it at least useful and interesting.)

The second statement, however, is true. Confidence intervals are inherently frequentist animals; they're constructed with the goal that no matter what the value of the unknown parameter is, for any fixed value you have the same prescribed probability of constructing a confidence interval that contains the “true” value of the parameter. You can't construct them according to this frequentist prescription and then reinterpret them in a subjectivist way; that leads to a statement that's false not because it doesn't follow frequentist dogma but because it wasn't derived for a subjective probability. A Bayesian approach leads to a different interval, which is rightly given a different name, a credible interval.

An instructive example is afforded by the confidence intervals for the unknown rate parameter of a Poisson process with known background noise rate. In this case, there are values of the data for which it is certain that the confidence interval does not contain the “true” parameter. This is not an error in the construction of the intervals; they have to be constructed like that to allow them to be interpreted in a frequentist manner. Interpreting such a confidence interval in a subjectivist manner would result in nonsense. The Bayesian credible intervals, on the other hand, always have a certain probability of containing the “random” parameter.

I read a nice exposition of this example recently but I can't find it right now – I'll post it if I find it again, but for now I think this paper is also a useful introduction. (Example $11$ on p. $20$ is particularly amusing.)


Here is how it appears in Larry Wasserman's All of Statistics

Warning ! There is much confusion about how to interpret a confidence interval. A confidence interval is not a probability statement about $\theta$ (parameter of the problem), since $\theta$ is a fixed quantity, not a random variable. Some texts interpret as follows: If I repeat over and over, the interval will contain the parameter 95 percent of the time. This is correct but useless since you rarely the same experiment over and over. A better interpretation is this: On day 1 you collect data and construct a 95 percent confidence interval for a parameter $\theta_1$. On day 2, you collect new data and construct a 95 percent confidence interval for an unrelated parameter $\theta_2$. [...] You continue this way constructing confidence intervals for a sequence of unrelated parameters $\theta_1, \theta_2, \dots$. Then 95 percent of your intervals will trap the true parameter value. There is no need to introduce the idea of repeating the same experiment over and over.


Confidence intervals within the frequentist paradigm: You are correct that these assertions (warning against interpreting the confidence interval as a probability interval for the parameter) come from the fact that confidence intervals arise in the classical frequentist method, and in that context, the parameter is considered a fixed "unknown constant", not a random variable. There is a relevant probability statement pertaining to the confidence interval, which is:

$$\mathbb{P}(L(\mathbf{X}) \leqslant \mu \leqslant U(\mathbf{X}) \mid \mu) = 1-\alpha,$$

where $L(\mathbf{X})$ and $U(\mathbf{X})$ are bounds formed as functions of the sample data $\mathbf{X}$ (usually via the use of rearrangement of a probability statement about a pivotal quantity). Importantly, the data vector $\mathbf{X}$ is the random variable in this probability statement, and the parameter $\mu$ is treated as a fixed "unknown constant". (I have indicated this by putting it as a conditioning variable, but within the frequentist paradigm you wouldn't even specify this; it would just be implicit.) The confidence interval is derived from this probability statement by taking the observed sample data $\mathbf{x}$ to yield the fixed interval $\text{CI}(1-\alpha) = [ L(\mathbf{x}), U(\mathbf{x}) ]$.

The reason for the assertions you are reading is that once you replace the random sample data $\mathbf{X}$ with the observed sample data $\mathbf{x}$, you can no longer make the probability statement analogous to the above. Since the data and parameters are both constants, you now have the trivial statement:

$$\mathbb{P}(L(\mathbf{x}) \leqslant \mu \leqslant U(\mathbf{x})) = \begin{cases} 0 & \text{if } \mu \notin \text{CI}(1-\alpha), \\[6pt] 1 & \text{if } \mu \in \text{CI}(1-\alpha). \end{cases}$$


Confidence intervals within the Bayesian paradigm: If you would prefer to interpret the unknown parameter $\mu$ as a random variable, you are now undertaking a Bayesian treatment of the problem. Although the confidence interval is a procedure formulated within the classical paradigm, it is possible to interpret it within the context of Bayesian analysis.

However, even within the Bayesian context, it is still not valid to assert a posteriori that the CI contains the true parameter with the specified probability. In fact, this posterior probability depends on the prior distribution for the parameter. To see this, we observe that:

$$\mathbb{P}(L(\mathbf{x}) \leqslant \mu \leqslant U(\mathbf{x}) \mid \mathbf{x}) = \int \limits_{L(\mathbf{x})}^{U(\mathbf{x})} \pi(\mu | \mathbf{x}) d\mu = \frac{\int_{L(\mathbf{x})}^{U(\mathbf{x})} L_\mathbf{x}(\mu) \pi(\mu)d\mu}{\int L_\mathbf{x}(\mu) \pi(\mu) d\mu}.$$

This posterior probability depends on the prior, and is not generally equal to $1-\alpha$ (though it may be in some special cases). The initial probability statement used in the confidence interval imposes a restriction on the sampling distribution, which constrains the likelihood function, but it still allows us freedom to choose different priors, yielding different posterior probabilities for the correctness of the interval.

(Note: It is easy to show that $\mathbb{P}(L(\mathbf{X}) \leqslant \mu \leqslant U(\mathbf{X})) = 1-\alpha$ using the law-of-total probability, but this is a prior probability, not a posterior probability, since it does not condition on the data. Thus, within the Bayesian paradigm, we may say a priori that the confidence interval will contain the parameter with the specified probability, but we cannot generally say this a posteriori.)