Infinite possibilities of you

This is a frequently-encountered 'boobytrap' in information theory, but it turns out that having a continuous degree of freedom does not entail free access to an infinite amount of information.

First of all, while the wavefunction is continuous, you can discretize it quite easily. We know every nucleus and electron in your brain is confined to within a box, say, 20cm a side, so we can describe it in that basis. You still have an infinite number of eigenstates, but it is now discrete. This shift in perspective from uncountable to countable infinity is due to the fact that physically accessible wavefunctions must be continuous and smooth, and there's not actually that many of those.

So far, we still have an infinite number of eigenstates in our description. However, we know that the energy content of the human brain is bounded, which means that after a given point, all the energy eigenstates must have negligible contribution. Indeed, if a sizable fraction of the electrons in our brain had energies above, say, 1 GeV, we would instantly come apart in a blaze of gamma rays and positrons and whatnot. Informationally, this means that you can provide an approximation to the wavefunction that's good for all practical purposes using only a finite number of states. An experiment that would distinguish states at that level of approximation would need so much energy it would incinerate your brain.

This 'paradox' is also present in classical information theory, and it comes about when you ask what the information capabilities of analog computing are. Here the state of a system is encoded in, say, a voltage, and in principle you have infinite information there because you can in principle measure as many digits as you want. However, it turns out that the scalings tend to be unfavourable, and noise kills you quickly; the upshot is that in analog computing you need to be very careful about what precision / tolerance you demand of your noise and measuring apparatus when counting information capabilities. As it happens, high precision tends to be harder to achieve than simply having more, coupled systems, with simply one bit per system.


A short addendum to Emilio Pisanty's last paragraph.

The information storable in one continuous variable (i.e. one real number, which in principle encodes $\aleph_0$ bits) is precisely quantified by Shannon's noisy channel coding theorem.

Let's suppose we have a normalized real variable $x \in [0,1]$: the interval represents the fact that we have a finite voltage range, or light intensity range, or whatever we use to record our information. We think of "writing down" our information from a source with Shannon entropy $H$ bits per symbol as a value $x\in[0,1]$. When we come to read this value, it has in general been corrupted by noise, so its value will be some other $y\in[0,1]$ and we can think of the writing / reading cycle of the same variable as a transmission through a noisy channel. Intuitively it makes sense to use only discrete values in the interval to stand for recorded information: the more tightly packed they are, the likelier they are to be corrupted in the read/write cycle, so we can see that there is going to be some limit here. So we have two discrete probability distributions $p_X(x_j)$ the distribution of which symbol is written in the real variable, and $p_Y(y_j)$ the distribution of which symbol is read instead.

The noisy channel coding theorem states that the maximum storage capacity $C$ in bits of this real variable is the supremum over all possible symbol $p_X(x_j)$ distributions of the Mutual Information of $p_X(x_j)$ and $p_Y(y_j)$ i.e.

$$C = \sup\limits_{p_X} \left(\sum\limits_{x_j}\sum\limits_{y_k} p_{X,Y}(x_j,y_k) \log_2\frac{p_{X,Y}(x_j,y_k)}{p_X(x_j)\,p_Y(y_k)}\right)$$

where $p_{X,Y}(x_j,y_k)$ is the joint distribution of the input $x$ and output $y$ and models the noise corruption of the written variable.

If the written variable is corrupted by Gaussian noise of variance $\sigma^2$, then we intuitively expect that the number of levels in $[0,1]$ that we can tell apart will be of the order of $\sigma^{-1}$ so we expect roughly $-\log_2 \sigma$ bits will be storable in the continuous interval. Indeed, if we apply the noisy channel coding theorem above to this situation, we find the Shannon-Hartley theorem, which is the noisy coding theorem for an additive Gaussian noise channel:

$$C = \frac{1}{2}\log_2(1 + \mathrm{SNR}) = \frac{1}{2}\log_2\left(1 + \frac{1}{\sigma^2}\right)$$

bits per symbol, which approaches our intuitive expression $-\log_2 \sigma$ as $\mathrm{SNR} = \sigma^{-2}\to\infty$. $\mathrm{SNR}$ is the signal to noise ratio.

It is important to take heed of the remarkable fact that $C$ represents a situation arbitrarily near to perfect, noiseless information storage and is not a "rough measure of storable bits". That is, the noisy channel coding theorem takes exact account of the possibility of error correcting coding spread over many such information storage variables. It assumes we have a large number of these unit intervals and that we spread our coded information over this large number and deliberately introduce correlations between them through codeword structure so as to detect and correct errors. If we are allowed to do this over an arbitrarily large number of these unit intervals, then the theorem shows us that we can noiselessly encode $C$ bits per continuous variable, with the probability of any errors (after error correction) approaching nought as the number of coded variables gathered into each codeword increases without bound.

This is why the theorem is so ingenious: without constructing the code, it can show that there exists one that will come arbitrarily near to achieving perfect storage, as long as we demand only up to and including $C$ bits per symbol. It also shows that if we try to store $C+\epsilon$ bits per symbol, for any $\epsilon>0$, then the probability of errors approaches unity as the number of read/write cycles approaches infinity, whichever coding scheme we may use. $C$ truly does represent the exact capacity of a noisy continuous variable.