Information entropy and physics correlation

I hope that my answers below will all be helpful.

  1. There are more than one way to think about this, but the one I find most helpful is to think of thermodynamic entropy as a specific instance of Shannon entropy. Shannon entropy is defined by the formula $$ H = -\sum_i p_i \log p_i, $$ but this formula has many different applications, and the symbols $p_i$ have different meanings depending on what the formula is used for. Shannon thought of them as the probabilities of different messages or symbols being sent over a communication channel, but Shannon's formula has since found plenty of other applications as well. One specific thing you can apply it to are the microscopic states of a physical system. If the probabilities $p_i$ represent the equilibrium probabilities for a thermodynamic system to be be in microscopic states $i$, then you have the thermodynamic entropy. (Very often it is multiplied by Boltzmann's constant in this case, to put it into units of $JK^{-1}$ --- see below.) If they represent something else (such as, for example, a non-equilibrium ensemble) then you just have a different instance of the Shannon entropy. So in short, the thermodynamic entropy is a Shannon entropy, but not necessarily vice versa.

    (One should note, though, that this isn't the way it developed historically --- the formula was in use in physics before Shannon realised that it could be generalised, and the entropy was a known quantity before that formula was invented. For a very good overview of the historical development of information theory and physics, see Jaynes' paper "Where do we stand on maximum entropy?" It is very long, and quite old, but well worth the effort.)

  2. The paper linked above will also help with this. Essentially, the Shannon entropy is the formula quoted above; the Gibbs entropy is that same formula applied to the microscopic states of a physical system (so that sometimes it's called the Gibbs-Shannon entropy); the Boltzmann entropy is $\log W$, which is a special case of the Gibbs-Shannon entropy that was historically discovered first; and the von Neumann entropy is the quantum version of the Gibbs-Shannon entropy.

  3. This is straightforward. The physical definition of the entropy is $$ S = -k_B \sum_i p_i \log p_i, $$ where the logarithms have base $e$, and $k_B \approx 1.38\times 10^{-23} JK^{-1}$ is Boltzmann's constant. Physicists generally consider $\log p_i$ to be unitless (rather than having units of nats), so the expression has units of $JK^{-1}$ overall. Comparing this to the definition of $H$ above (with units of nats) we have $$ 1\,\mathrm{nat} = k_B\,JK^{-1}, $$ i.e. the conversion factor is just Boltzmann's constant.

    If we want to express $H$ in bits then we have to change the base of the logarithm from $e$ to 2, which we do by dividing by $\ln 2$: $$ H_\text{bit} = -\sum_i p_i \log_2 p_i = -\sum_i p_i \frac{\ln p_i}{\ln 2} = \frac{H_\text{nat}}{\ln 2}. $$ So we have $$ 1\,\mathrm{bit} = \ln 2\,\,\mathrm{nat}, $$ and therefore $$ 1\,\mathrm{bit} = k_B\ln 2\,JK^{-1} \approx 9.57\times 10^{-24} JK^{-1}. $$

    You will see this conversion factor, for example in Landauer's principle, in which erasing one bit requires $k_B T \ln 2$ joules of energy. This is really just saying that that deleting a bit (and therefore lowering the entropy by one bit) requires raising the entropy of the heat bath by one bit, or $k_B \ln 2$. For a heat bath of temperature $T$ this can be done by raising its energy by $k_B T \ln 2\,\, J$.

    As for the intuitive interpretation of nats, this is indeed a little tricky. The reason nats are used is that they're mathematically more convenient. (If you take the derivative you won't get factors of $\ln 2$ appearing all the time.) But it doesn't make nice intuitive sense to think of distinguishing between 2.718 things, so it's probably better just to think of a nat as $\frac{1}{\ln 2}$ bits, and remember that it's defined that way for mathematical convenience.


Question 1:

The Boltzmann entropy $S_B=k_B\ln\Omega(E)$ is valid only for the microcanonical ensemble. In the microcanonical ensemble, all accessible microstates (accessible = they have energy $E$, at least with some $\delta E$ uncertainty) have equal probability. So if $r$ is an index that labels microstates, we have $$ p_r=C, $$ where $C$ is a constant. The normalization means that $C=1/\Omega(E)$, where $\Omega(E)$ is the number of accessible microstates.

Let us define a more general entropy as $$ S_S=-k_B\sum_r p_r\ln(p_r), $$ valid for an arbitrary probability distribution $p_r$. What happens if $p_r=C$? Then we have $$ S_S=-k_B\Omega(E)C\ln(C)=-k_B\ln\left(\frac{1}{\Omega(E)}\right)=k_B\ln(\Omega(E))=S_B. $$ The only thing that is questionable is what's with $k_B$ and $\ln$ instead of $\log_2$.

The point is, the multiplicative constant doesn't really matter, and different logarithms are related by multiplicative constants. What we have is that in the microcanonical ensemble, temperature is defined as $$ \frac{1}{T}=\frac{\partial S_B}{\partial E}. $$ Early phenomenological thermodynamists on the other hand had no clue about what temperature actually is, so they invented a unit, $K$ for it. From he perspective of equilibrium statistical mechanics, it is far more natural to measure temperature in units of energy, rather than Kelvin. So, whatever multiplicative constants happen to be in the formula for entropy, they essentially act as conversion factors between units of temperature and units of energy. And aside from multiplicative factors, $S_S$ is Shannon-entropy. So they are essentially the same, with the understanding that Boltzmann's entropy is a special case for microcanonical ensembles.

Interesting tidbit: Consider Shannon-entropy $S_S$ as a functional of probability distribution: $$ S_S[p]=-\sum_r p_r\ln(p_r). $$ Here I set $k_B=1$. What are the critical points of this functional? We can do calculus of variations, but we only vary probability distributions, so we need to enforce $\sum_r p_r=1$. The functional to be varied is then $$ F[p]=-\sum_r p_r\ln(p_r)-\gamma\left(1-\sum_r p_r\right) $$ where $\gamma$ is a Lagrange-multiplier. After variation we get $$ \delta F[p]=-\sum_r\left(\delta p_r\ln(p_r)+p_r\frac{1}{p_r}\delta p_r+\gamma\delta p_r\right). $$ Setting this to 0 gives $$ \ln(p_r)=-(1+\gamma)\Rightarrow p_r=e^{-1-\gamma}=C, $$ where $C$ can be determined from normalization.

So basically, the microcanonical ensemble is precisely the one which maximizes entropy.

Question 2: I am not sure which is Gibbs entropy (probably the "modified" Shannon entropy?), but they are basically all the same in different formulations and different conventions for temperature. Von Neumann entropy is of course quantum mechanical, but it reduces to usual entropy if you diagonize the density matrix.

If you are curious about the meaning of entropy, I think you should drop strict information theory and just look at probability theory. It is probably simpler to consider the negative of entropy, $I=\sum_r p_r\ln(p_r)$. It can be seen that this essentially measures how much knowledge you'd gain if you were to know which state the system is in. Assume the probability distribution is such that only one state has nonzero probability, so $p_r=1$ for a specific $r$ and 0 for the rest. Then entropy is ostensibly zero. And indeed, since that state is the only realizable state, you gain absolutely no information if somebody tells you what its state is. On the other hand, if all states are equiprobable, then you have absolutely no basis to "guess" the state of the system without knowing anything about it. If someone tells you the state of the system, you gain quite a lot of information. If the probability distribution is "spiky", then this $I$ quantity is lower, than if it was even, because if you "guess" that the state of the sytem is in the "spiky" domain, you'd be more often right than not.

So I somewhat retrace my statement and say that it isn't so much about how much knowlegde one gained if they were told the state of the system (but clearly, it is related), but rather, how likely it is that you can guess which state of the system is realized, just by knowing the distribution. For a "spiky" distribution, the system is likely near the spike, so it is pretty guessable. For a system that is evenly distributed, your guess is worthless. It is a measure of "spikiness", a measure of how evenly the system is distributed over its accessible states.

Question 3: I cannot really answer this directly, mainly because my knowledge of information theory isn't that high, so I'll only say what I already said in 1, that the multiplicative constant of units $J/K$ is only needed to make contact with what phenomenological thermodynamists of old defined as temperature. In the microcanonical ensemble, the entropy is given by the logarithm of the number of accessible microstates, which depend on energy. Inverse temperature is the response of entropy to the change in energy, so it should have dimensions of energy. With that said, if you defined entropy in the microcanonical ensemble as $$ S=\log_2(\Omega(E)), $$ then temperature would have units of $J/\text{bit}$, if you'd like. Then, if you defined $k_B$ with units of energy, then the unit of temperature would be $1/\text{bit}$.

Edit - clarifications:

I cannot shake the nagging feeling that I did not answer this question satisfyingly, so I'd like to clarify certain points.

Related to Question 3, I think (but I might be wrong, not an expert in this field) that relating temperature to information is somewhat futile, at least beyond superficialities. Temperature is only defined in a meaningful way for equilibrium systems. Specifically, temperature is only defined for microcanonical ensembles. Realize that it is not meaningful to talk about microcanonical ensembles that do not describe equilibrium systems. Non-equilibrium systems have time-dependent probability distributions. But a microcanonical ensemble is in a very specific distribution (even distribution), so you cannot have time-evolution if this even-ness is to be kept. For all other ensembles, temperature is defined by being in equilibirum with another system such that they, together form a microcanonical ensemble.

On the other hand entropy/information is meaningful as soon as you got a probability distribution.

Related to the interpretation of entropy , I think it is probably best to not to think about it either in the context of information theory or thermodynamics. Even if those two fields were the main inspiration for the conept of entropy, it is a concept in probability theory. Both information theorists and thermodynamists use entropy for their own nefarious purposes, so it is best to abstract it away.

Entropy is simply a number associated to a probability distribution. I thought things through and I think I can give a better recount of what it means than I did in the main answer. Instead of considering $S=-\sum_r p_r\ln(p_r)$, let us consider $I_r=-\ln(p_r)$ where $p_r$ is the probability of a specific state. Let us call this the "information" of the state $r$.

Since $p_r$ may take on values between $0$ and $1$, and $I_r$ is a monotonically decreasing function of $p_r$, we need to consider only the limiting cases, 0 and 1.

If $p_r=1$, $I_r=0$. In this case, the system is not probabilistic, but deterministic. Thus, there is no information to be gained if a wizard suddenly told us that the system is in $r$. It is trivial. No information content.

On the other hand, if $p_r=0$, $I_r=\infty$. This case is singular, so it is difficult to interpret. Basically, if a wizard told us that the system is in $r$, he'd be lying. But if we consider the case when $p_r=\epsilon$ very small, but nonzero, $I_r$ approaches infinity. If a wizard told us that the system is in $r$, a very unlikely state, we'd be surprised. It would, in some sense, net us a great deal of information, since it is very unlikely that the system is in $r$.

Entropy is then $$ S=\sum_r p_rI_r=\left< I\right>, $$ the expectation value of information. So it is a kind of "average information content" of the distribution. If the distribution is even, we know very little about the state of the system, since it can be in any. If the distribution is spiky, we pretty much know that the system is near the spike. Entropy parametrizes our ignorance about the system, if we only know the distribution and nothing else.