How do you prove $S=-\sum p\ln p$?

The theorem is called the noiseless coding theorem, and it is often proven in clunky ways in information theory books. The point of the theorem is to calculate the minimum number of bits per variable you need to encode the values of N identical random variables chosen from $1...K$ whose probabilities of having a value $i$ between $1$ and $K$ is $p_i$. The minimum number of bits you need on average per variable in the large N limit is defined to be the information in the random variable. It is the minimum number of bits of information per variable you need to record in a computer so as to remember the values of the N copies with perfect fidelity.

If the variables are uniformly distributed, the answer is obvious: there are $K^N$ possiblities for N throws, and $2^{CN}$ possiblities for $CN$ bits, so $C=\log_2(k)$ for large N. Any less than CN bits, and you will not be able to encode the values of the random variables, because they are all equally likely. Any more than this, you will have extra room. This is the information in a uniform random variable.

For a general distribution, you can get the answer with a little bit of law of large numbers. If you have many copies of the random variable, the sum of the probabilities is equal to 1,

$$ P(n_1, n_2, ... , n_k) = \prod_{j=1}^N p_{n_j}$$

This probability is dominated for large N by those configurations where the number of values of type i is equal to $Np_i$, since this is the mean number of the type i's. So that the P value on any typical configuration is:

$$ P(n_1,...,n_k) = \prod_{i=1}^k p_i^{Np_i} = e^{N\sum p_i \log(p_i)}$$

So for those possibilities where the probability is not extremely small, the probability is more or less constant and equal to the above value. The total number M(N) of these not-exceedingly unlikely possibilities is what is required to make the sum of probabilities equal to 1.

$$M(N) \propto e^{ - N \sum p_i \log(p_i)}$$

To encode which of the M(N) possiblities is realized in each N picks, you therefore need a number of bits B(N) which is enough to encode all these possibilities:

$$2^{B(N)} \propto e^{ - N \sum p_i \log(p_i)}$$

which means that

$${B(N)\over N} = - \sum p_i \log_2(p_i)$$

And all subleading constants are washed out by the large N limit. This is the information, and the asymptotic equality above is the Shannon noiseless coding theorem. To make it rigorous, all you need are some careful bounds on the large number estimates.

Replica coincidences

There is another interpretation of the Shannon entropy in terms of coincidences which is interesting. Consider the probability that you pick two values of the random variable, and you get the same value twice:

$$P_2 = \sum p_i^2$$

This is clearly an estimate of how many different values there are to select from. If you ask what is the probability that you get the same value k-times in k-throws, it is

$$P_k = \sum p_i p_i^{k-1}$$

If you ask, what is the probability of a coincidence after $k=1+\epsilon$ throws, you get the Shannon entropy. This is like the replica trick, so I think it is good to keep in mind.

Entropy from information

To recover statistical mechanics from the Shannon information, you are given:

  • the values of the macroscopic conserved quantities (or their thermodynamic conjugates), energy, momentum, angular momentum, charge, and particle number
  • the macroscopic constraints (or their thermodynaic conjugates) volume, positions of macroscopic objects, etc.

Then the statistical distribution of the microscopic configuration is the maximum entropy distribution (as little information known to you as possible) on phase space satisfying the constraint that the quantities match the macroscopic quantities.


The best (IMHO) derivation of the $\sum p \log p$ formula from basic postulates is the one given originally by Shannon:

Shannon (1948) A Mathematical Theory of Communication. Bell System Technical Journal. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6773024

However, Shannon was concerned not with physics but with telegraphy, so his proof appears in the context of information transmission rather than statistical mechanics. To see the relevance of Shannon's work to physics, the best references are papers by Edwin Jaynes. He wrote dozens of papers on the subject. My favorite is the admittedly rather long

Jaynes, E. T., 1979, `Where do we Stand on Maximum Entropy?' in The Maximum Entropy Formalism, R. D. Levine and M. Tribus (eds.), M. I. T. Press, Cambridge, MA, p. 15; http://bayes.wustl.edu/etj/articles/stand.on.entropy.pdf