Newton's method in higher dimensions explained

I'll assume we're trying to minimize a twice continuously differentiable function $f$ defined on $\mathbb R^p$.

We wish to find $x$ such that $\nabla f(x) = 0$.

Given $x_n$, we would ideally like to find $\Delta x$ such that $\nabla f(x_n + \Delta x) = 0$. Rather than satisfying this requirement exactly (which would probably be too difficult), we instead use the approximation \begin{equation*} \nabla f(x_n + \Delta x) \approx \nabla f(x_n) + Hf(x_n) \Delta x. \end{equation*} Setting the right hand side equal to $0$ gives us \begin{equation*} \Delta x = -Hf(x_n)^{-1} \nabla f(x_n). \end{equation*} We can hope that $x_{n+1} = x_n + \Delta x$ will be an improvement on $x_n$.


Here is another interpretation of Newton's method that I rather like.

Newton's method takes the known information of the function at a given point (value, gradient and Hessian), makes a quadratic approximation of that function, and minimizes that approximation.

More specifically, suppose $x_n$ is given, $g = \nabla f(x_n)$, and $H = \nabla^2 f(x_n)$. The quadratic approximation of $f$ at $x_n$ is a quadratic function $h(x)$ such that $h(x_n) = f(x_n)$, $\nabla h(x_n) = g$ and $\nabla^2 h(x_n) = H$. It turns out that $$ h(x) = \frac 12(x - x_n)^T H (x - x_n) + g^T (x - x_n) + f(x_n). $$ This function has a unique global minimum if and only if $H$ is positive definite. This is the requirement for Newton's method to work. Assuming $H$ is positive definite, the minimum of $h$ is achieved at $x^*$ such that $\nabla h(x^*) = 0$. Since $$ \nabla h(x) = H(x - x_n) + g, $$ we get $x^* = x_n - H^{-1}g$.


The Hessian is simply the higher-dimensional generalization of the second derivative, and multiplying by its inverse is the non-commutative generalization of dividing by $f''$ in the one-dimensional case. I don't know if it's reasonable to try a geometric argument along the line of your link for the double generalization to optimization and to $n$ dimensions: better to just look up an actual proof.

It might be less confusing if you look at the higher-dimensional Newton's method for roots, before that for optimization. That's the "nonlinear systems of equations" section on Wikipedia. Hubbard and Hubbard's book on linear algebra, multivariable calculus, and differential forms has the best treatment of multivariate Newton's method I know.