concentration inequality for entropy from sample

Actually, Bernstein's inequality does not really require boundedness of the i.i.d. random summands; a finite exponential moment of the absolute value of a random summand will suffice. However, here we can just use Markov's inequality.

Let $X,X_1,\dots,X_n$ be independent identically distributed random variables (i.i.d. r.v.'s) such that $P(X=z)=\mu(z)\in(0,1)$ for all $z$ in a finite set $Z$, with $\sum_z\mu(z)=1$. Let $Y:=-\ln\mu(X)$, $Y_i:=-\ln\mu(X_i)$, and $S_n:=\frac1n\,\sum_1^n Y_i$. Then $$ES_n=EY=-\sum_z\mu(z)\ln\mu(z)=:H(\mu)>0 $$ and $$Ee^{hY}=\sum_z\mu(z)^{1-h},\quad Ee^{hS_n}=(Ee^{hY})^n \tag{1} $$ for all real $h$. To avoid trivialities, assume that $\max Y:=-\min_z\ln\mu(z)>-\max_z\ln\mu(z)=:\min Y$. Then $$\min Y<EY=H(\mu)<\max Y.$$

Take now any real $t$ such that $H(\mu)=EY\le t<\max Y$. For all real $h\ge0$, by Markov's inequality, $$P(S_n\ge t)\le\exp\{-nht+\ln Ee^{nhS_n}\}=\exp\{-nht+n\ln Ee^{hY}\}. \tag{2} $$ The derivative of the exponent $-nht+n\ln Ee^{hY}$ in $h$ is $-nt+n\frac{EYe^{hY}}{Ee^{hY}}$, which strictly and continuously increases from $-nt+nEY\le 0$ to $-nt+n\max Y\ge0$ as $h$ increases from $0$ to $\infty$, and so, the upper bound $\exp\{-nht+n\ln Ee^{hY}\}$ on the right-tail probability $P(S_n\ge t)$ is minimized when $h=h_{t,+}=h_{t,\mu,+}$ is the only nonnegative root of the equation $$m(h):=m_\mu(h):=\frac{EYe^{hY}}{Ee^{hY}}=t. \tag{3} $$

Similarly, for any real $t$ such that $H(\mu)=EY\ge t>\min Y$, the best upper exponential bound on the left-tail probability $P(S_n\le t)$ is $\exp\{-nht+n\ln Ee^{hY/n}\}$, where now $h=h_{t,-}=h_{t,\mu,-}$ is the only non-positive root of the equation $(2)$.

Thus, $$P(S_n\ge t)\le e^{-na_+(t)},\quad\text{where $a_+(t):=h_{t,+}t-\ln Ee^{h_{t,+}Y}>0$} \tag{4} $$ if $H(\mu)=EY<t<\max Y$, $$P(S_n\le t)\le e^{-na_-(t)},\quad\text{where $a_-(t):=h_{t,-}t-\ln Ee^{h_{t,-}Y}>0$} \tag{5} $$ if $H(\mu)=EY>t>\min Y$ So, these bounds exponentially decrease in $n$ if $t$ is fixed. By formulas (3.7) and (3.8) in [Chernoff], bounds $(4)$ and $(5)$ cannot be improved by replacing $a_\pm(t)$ by greater values.

In view of $(1)$, equation $(3)$ can be rewritten as $$\sum_z(t+\ln\mu(z))\mu(z)^{1-h}=0, \tag{6} $$ and this equation can be easily solved for $h$ numerically if the set $Z$ is not too large.

All this is of course well known, even in the general case of i.i.d. random summands with finite exponential moment of the absolute value. Basically, in this particular situation I have just added the first equality in $(1)$ and rewrote $(3)$ as $(6)$.


Here's a step that seems nice enough to point out. It still leaves a parameter to pick, and I'm not sure it's ever better than applying Bernstein, but it does something different.

We can get a probability bound in terms of how much $S_n$ exceeds the Renyi entropy $H_{\alpha}$ of $\mu$ (equivalently, worded in terms of the $\ell_{\alpha}$ norm of $\mu$), for any $0 < \alpha < 1$. The unresolved question is if we can to pick $\alpha$ to get a nice closed form of some kind. Maybe someone more clever than I can speak to that.

Claim. Let $X_1,\dots,X_n$ be i.i.d. according to $\mu$ and $Y_i = \log(1/\mu(X_i))$; let $S_n = \frac{1}{n} \sum_{i=1}^n Y_i$. Then for any $0 < \alpha < 1$, \begin{align} \Pr[ S_n \geq t ] &\leq 2^{-n (1-\alpha) \left( t - H_{\alpha}(\mu) \right) } \\ &= 2^{-n \left( (1-\alpha)t - \alpha \log \| \mu \|_{\alpha} \right) } . \end{align} Here I'm writing $\mu = (\mu_1,\dots,\mu_m)$ as a vector of probabilities. Note that $H_{\alpha}$ is decreasing in $\alpha$ and $H_1 = H$, Shannon entropy. So as $n \to \infty$, we can pick $\alpha \to 1$ and get tail bounds for $t \to H(\mu)$.

Proof. Using the general Chernoff method, \begin{align} \Pr[S_n \geq t] &= \Pr\left[ 2^{\lambda S_n} \geq 2^{\lambda t}\right] & (\forall \lambda \geq 0) \\ &\leq \frac{\mathbb{E} 2^{\lambda S_n} }{2^{\lambda t}} & (\text{Markov's}). \end{align} We have \begin{align} \mathbb{E} 2^{\lambda S_n} &= \left( \mathbb{E} 2^{\frac{\lambda}{n} Y_1} \right)^n \\ &= \left( \mathbb{E} \mu(X_1)^{-\lambda/n} \right)^n \\ &= \left( \sum_{j=1}^m \mu_j^{1-\lambda/n} \right)^n . \end{align} Hence \begin{align} \Pr[S_n \geq t] \leq 2^{-n \left(\frac{\lambda}{n} t - \log \sum_j \mu_j^{1-\lambda/n} \right)} . \end{align} Pick $\lambda$ such that $1-\lambda/n = \alpha$, for a chosen $\alpha \in [0,1]$. In other words, $\frac{\lambda}{n} = 1-\alpha$, and factoring this out and substituting, \begin{align} \Pr[S_n \geq t] \leq 2^{-n (1-\alpha) \left(t - \frac{1}{1-\alpha} \log \sum_j \mu_j^{\alpha} \right)} . \end{align}