Why does gradient descent work?

In two dimensions, like in the example of $f(x)=x^2$,at the point $(5,25)$ on the curve, there are only two ways to "move" and still be on the curve: to the left and to the right, and in our case, the value of the function decreases as we move to the left, and increases as we go to the right. No problems.

But, when we are on, for example the following function $z=f(x,y)$ which is a two variable function, which looks like the inverted dome below.

$f(x,y)$

Imagine that you are on top of the dome. Then, There are an infinite number of curves(see figure) you can move on and still stay on the surface. And, since our location is so symmetric, it does not matter which direction we move in- it will always be the same rate of change in all directions.

But, suppose we were not on such a symmetric location, say a little lower in any direction. Then, with a little more thought we can see that there is exactly one direction in which you must move in order to achieve maximum descent. That is downward on the curve connecting the point you are on to the top of the dome.


As for rigorous proof of this phenomenon,J.M's link in the comments is an excellent resource.


The concept of a direction of fastest descent only makes sense in more than one dimension. In your one-dimensional example, you can only go left or right, and the function increases in one direction and decreases in the other; there's nothing else to compare.

In several dimensions, however, you have a continuum of directions to choose from, and the directional derivative is maximal in the direction of the gradient.


In your example, the gradient at $x= 5$ is $10$, so the negative gradient is $-10$. Since the negative gradient is negative (not neccesarily the case, thinkk about $x= -5$), it decreases fastest in the negative direction.

The gradient can be interpreted as the slope of a tangent. In the one-variable case, the tangent is a line and the slope can be represented as a single number. In the multivariable case, let's say with a function like $f(x, y, z) = x^2 + zy^3 - \frac{x}{y^2}$, the gradient will be a vector with corresponding to the direction of fastest ascent. Which means the fastest descent will be the opposite direction: the direction of the negative gradient.