Is Bayes' Theorem really that interesting?

You are mistaken in thinking that what you perceive as "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science" is really "the massive importance that is afforded to Bayes' theorem in undergraduate courses in probability and popular science." But it's probably not your fault: This usually doesn't get explained very well.

What is the probability of a Caucasian American having brown eyes? What does that question mean? By one interpretation, commonly called the frequentist interpretation of probability, it asks merely for the proportion persons having brown eyes among Caucasian Americans.

What is the probability that there was life on Mars two billion years ago? What does that question mean? It has no answer according to the frequentist interpretation. "The probability of life on Mars two billion years ago is $0.54$" is taken to be meaningless because one cannot say it happened in $54\%$ of all instances. But the Bayesian, as opposed to frequentist, interpretation of probability works with this sort of thing.

The Bayesian interpretation applied to statistical inference is immune to various pathologies afflicting that field.

Possibly you have seen that some people attach massive importance to the Bayesian interpretation of probability and mistakenly thought it was merely massive importance attached to Bayes's theorem. People who do consider Bayesianism important seldom explain this very clearly, primarily because that sort of exposition is not what they care about.


While I agree with Michael Hardy's answer, there is a sense in which Bayes' theorem is more important than any random identity in basic probability. Write Bayes' Theorem as

$$\text{P(Hypothesis|Data)}=\frac{\text{P(Data|Hypothesis)P(Hypothesis)}}{\text{P(Data)}}$$

The left hand side is what we usually want to know: given what we've observed, what should our beliefs about the world be? But the main thing that probability theory gives us is in the numerator on the right side: the frequency with which any given hypothesis will generate particular kinds of data. Probabilistic models in some sense answer the wrong question, and Bayes' theorem tells us how to combine this with our prior knowledge to generate the answer to the right question.

Frequentist methods that try not to use the prior have to reason about the quantity on the left by indirect means or else claim the left side is meaningless in many applications. They work, but frequently confuse even professional scientists. E.g. the common misconceptions about $p$-values come from people assuming that they are a left-side quantity when they are a right-side quantity.


You might know only $\Pr[A\mid B]$ and not $\Pr[B\mid A]$, not because someone "adversarially told you the wrong one", but because one of those is a natural quantity to compute, and the other is a natural quantity to want to know.

I am about to teach Bayes' theorem in an undergraduate course in probability. The general setting I want to consider is when:

  • We have several competing hypotheses about the world. (Several candidates for $B$.)
  • If we assume one of these hypotheses, then we get a nice and easy probability problem where it's easy to find the probability of $A$: some observations that we've made. (Outside undergraduate probability courses, "nice and easy" is a relative term.)
  • We want to figure out which hypothesis is likelier.

The mammogram example might be natural, but it's less obviously natural because we have to track down where the numbers that are given to us come from, and ask why we couldn't be given the other quantities in the problem. So here are some examples where we have fewer numbers coming to us out of thin air.

  1. Suppose you are communicating over a binary channel which flips bits $10\%$ of the time. (This part is given to us out of nowhere, but it's the natural quantity to ask about first.) Your friend has several possible messages they might send you: these are the hypotheses $B_1, B_2, \dots, B_n$. You receive a message: that's the observation $A$. Then $\Pr[A \mid B_i]$ is just $(0.1)^k (0.9)^{n-k}$ if $B_i$ is an $n$-bit message that differs from the one you received in $k$ places. On the other hand, $\Pr[B_i \mid A]$ is the quantity we want: it will tell us how likely it is that your friend sent each message.
  2. You have a coin, and you don't know anything about its fairness. One possible assumption is that it lands heads with probability $p$, where $p \sim \text{Uniform}(0,1)$, but we could vary this. Then you flip the coin $n$ times and see $k$ heads. There are infinitely many hypotheses $B_p$, one for each possible $p$; under each of them, $\Pr[A \mid B_p]$ is just a binomial probability. Knowing the conditional PDF of $p$, which is what Bayes' theorem tells us, tells us more about how likely the coin is to land heads.