Where Does the Hessian Matrix Come from (Why Does it Work)?

The Fundamental Strategy of Calculus is to take a nonlinear function (difficult) and approximate it locally by a linear function (easy). If $f:\mathbb R^n \to \mathbb R$ is differentiable at $x_0$, then our local linear approximation for $f$ is $$ f(x) \approx f(x_0) + \nabla f(x_0)^T(x - x_0). $$ But why not approximate $f$ instead by a quadratic function? The best quadratic approximation to a smooth function $f:\mathbb R^n \to \mathbb R$ near $x_0$ is $$ f(x) \approx f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac12 (x - x_0)^T Hf(x_0)(x - x_0) $$ where $Hf(x_0)$ is the Hessian of $f$ at $x_0$.


The Hessian is an essential part of the multidimensional Taylor expansion of a sufficiently smooth function. Total differentiability of a function $f:U\to\mathbb R$ in $x_0\in U$ for an open subset $U\subseteq \mathbb R^n$ means that there is a linear map $L:\mathbb R^n\to \mathbb R$ such that

$$\lim_{x\to x_0}\frac{f(x)-[f(x_0)+L(x-x_0)]}{\Vert x-x_0\Vert}=0.$$

That's the definition of total differentiability. The term in $[]$ is then the first order Taylor approximation of $f$ around $x_0$, and we call $L$ the gradient. The equation essentially tells us that as we go to $x_0$, the difference between $f$ and its Taylor approximation gets arbitrarily small quickly. We could also derive that the gradient's matrix representation is $\nabla f(x_0)$, but I'll skip this.

Now if $f$ is twice totally differentiable, this means that additionally there is a bilinear form $B:\mathbb R^n\times\mathbb R^n\to\mathbb R$ such that

$$\lim_{x\to x_0}\frac{f(x)-[f(x_0)+L(x-x_0)+\frac{1}{2}B(x-x_0,x-x_0)]}{\Vert x-x_0\Vert^2}=0.$$

This is not a definition, but the statement of one of the several versions of Taylor's theorem. The term in $[]$ is now the second order Taylor approximation, and we call $B$ (or rather its matrix representation) the Hessian of $f$, and we get $B(v,w)=w^T \mathrm Hf(x_0) v$. It also happens to be the total differential of the function $x\mapsto \nabla f(x)$, which would allow us to derive its components, but again, I'll skip that.

With this, the Taylor approximation of a twice totally differentiable function becomes

$$f(x)\approx f(x_0)+\nabla f(x_0)\cdot(x-x_0)+\frac{1}{2}(x-x_0)^T \cdot\mathrm Hf(x_0)\cdot(x-x_0).$$

From here it might be intuitively clear why the Hessian tells us about the type of critical point. If $\nabla f=0$, then the Taylor approximation is just a constant plus the Hessian term. And if the Hessian is positive or negative definite, it means that this term either only increases (positive definite) or decreases (negative definite) if $x-x_0$ moves away from 0 (and thus $x$ moves away from $x_0$). So we have to be at a minimum/maximum. If it is indefinite, however, that means that as $x$ goes away from $x_0$ in some direction, the Hessian term increases, while in another direction it decreases. So we have to be at a saddle point.


Let's suppose for simplicity that the critical point we are trying to analyze is $p=(0,0)$.

Take some direction $u$. If we compute $(f(tu))''(0),$ we are analyzing the concavity of the restriction of $f$ to the $(u,z)$ plane by single-variable calculus. For example, if this value is positive for every direction $u$, then $f$ has a point of local minimum at $p$.

Computing $(f(tu))''(0)$, you arrive at $\langle \mathrm{Hess}f(p) u, u\rangle$. This alone tells us how the Hessian appears when analyzing if a critical point is a local minimum, saddle or local maximum. But let's understand why the determinant is relevant in the two dimensional case.

It is known that if $A$ is a symmetric matrix, the function \begin{align} g:\mathbb{R}^n &\to \mathbb{R} \\ x &\mapsto \langle Ax,x \rangle, \end{align} when restricted to the sphere $S^{n-1}$, achieves its maximum and minimum value at eigenvectors of $A$. (You can prove this using Lagrange multipliers for example.) Note that if $v$ is an eigenvector then $g(v)=\langle Av ,v \rangle=\langle \lambda v,v \rangle=\lambda$. So if all eigenvalues are positive, then $g$ is positive and $p$ is a local minimum, if there is one positive eigenvalue and one negative then it is a saddle and if all are negative, then it is a local minimum.

Since the determinant is the product of the eigenvalues, analyzing it is enough to determine the information of the signs of the eigenvalues in two dimensions if the Hessian is non-degenerate. If the determinant is positive, then both eigenvalues are either both positive or both negative. (Thus a local maximum or minimum. Then we look at, for example, the sign of $\partial_1^2f=\langle \mathrm{Hess}f(p)e_1,e_1 \rangle$ to determine which case.) If it is negative, then they are of opposite signs, thus a saddle.