Why the self-information is $-\log(p(m))$?

There are several possible answers to this. One is to look at Shannon's definition of the entropy, $$ H = -\sum_i p_i \log p_i, $$ and note that it has the form of an expectation: $H$ is the expected value of $-\log p_i$, so it makes sense to give a name to that latter quantity. This is nice if you understand the value of the entropy. In Shannon's paper ('A Mathematical Theory of Communication', readily available online) he gives a very nice derivation of the definition of $H$ from first principles, and it's well worth going through that if you really want to understand where these quantities come from.

But there's also a more intuitive, heuristic way to understand the self-information. Let's take the logarithm to base 2, so that the surprisal is measured in bits.

Now, let's imagine that we have an $n$-bit communication channel, and that all possible messages are equally likely. So for $n=8$, a message might be something like $00101110$. How much information do we gain if we receive that particular message? Intuitively, it should be fairly obvious that it's 8 bits.

Now let's ask what was the probability of receiving that message? Well, that's given by $\left(\frac{1}{2}\right)^8$, or 1 in 256. Given this, we can calculate the self-information as $-\log_2 \frac{1}{256} = 8$ bits.

It's easy to see that this will work with any $n$. So the self-information is the (unique) way to convert probabilities into bits so that it agrees with our intuition in such simple example cases, while also not doing anything weird in less intuitive cases.

As others pointed out in the comments, when $p=1$ there is no information in the message. This actually fits nicely with our intuition as well. For example: how much information do you gain if I tell you we are located on the planet Earth? Well, none at all, since you already knew that. In your mind there was no probability that we could be located on any other planet, and therefore my message didn't tell you anything. In general if you know something is going to happen you won't be surprised when it does, so it makes sense that the surprisal should be zero in such cases.


For me, the thing that really gives Shannon's definition legs is Shannon's Noiseless Coding Theorem. It proves the following remarkable and important fact:

Let an information source send a message comprising statistically independent symbols and suppose that these symbols belong to an $N$ letter alphabet, and that the propability of transmission of the $j^{th}$ symbol be $p_j$.

Suppose now that the source sends through a communication channel that can sends 0s and 1s such that it can noiselessly send $h$ bits per second, but no faster (you might imagine a telecomunications link that can reliably switch between "off" to "on" state $h$ times per second).

Suppose now we send $M$ symbols through this link (note that there will need to be a coding scheme, such as ASCII or Unicode, to transform our symbols into sequences of 0s and 1s) at a rate of $s$ symbols per second.

Then the noiseless coding theorem shows that there exists a coding scheme such that the probability of error $\to 0$ as $M\to\infty$ if $\frac{h}{s} > H = -\sum\limits_{j=1}^Np_j\,\log\,p_j$.

Conversely, if $\frac{h}{s} < H = -\sum\limits_{j=1}^N p_j\,\log\,p_j$, then the probability of transmission error $\to 1$ as $M\to\infty$ no matter what coding scheme may be used.

In other words, if you allocate $H$ bits per symbol in your link capacity, and if you are willing to buffer up messages so that they are sent in big chunks $M$ symbols long, then you can find a coding scheme that will make the probability of transmission error arbitrarily small. "You are bound to succeed" by making $M$ big enough.

If you allocate $H-\epsilon$ bits per symbol, where $\epsilon>0$, then there will certainly be transmission errors no matter how small $\epsilon$ is. "If you scrimp on the channel's maximum transmission rate, even a teeny tiny bit, you are bound to fail".

$H$ really is a measure of exactly how fast you have to signal, and not a jot more, in bits per symbol, if you want to transmit messages reliabily for this information source.

See also the appendix to E. T. Jaynes, "Information Theory and Statistical Mechanics", which shows that $H$ can also be construed as the unique function that fulfills three reasonable properties of continuity, monotonicity and the so-called "composition law".