Why is any arbitrary directional derivative always recoverable from the gradient?

Why can't a simultaneous increase in x and y give a dramatically different result than either alone? (e.g. a function that rises in the x+ direction and the y+ direction, but falls dramatically along the diagonal?)

Well, it can, but then the function won't be differentiable. One concrete example of a function that has different behavior in the $x$ and $y$ axes then it has in between is the function $z = r\sin(2 \theta)$, in cylindrical coordinates. This function is not differentiable at the origin. It is continuous at the origin and has slopes of $0$ in the $x$ and $y$ directions there - the $x$ and $y$ axes are both contained in the graph of the function. But in other directions the slopes at the origin can be anything else between $1$ and $-1$.

Remember that a point and two slopes in non-parallel directions are all that we need to completely determine a plane. So, if the tangent plane to the graph of $f(x,y)$ is well defined at a point, the slopes of the tangent plane in the $x^+$ and $y^+$ directions completely characterize the plane. A plane, being flat, can't increase along the $x^+$ and $y^+$ axes and decrease in between.

If a function tried to do that, it would not be differentiable at the point in question - it would not be well approximated by the plane that the gradient determines. This is the source of the definition of differentiability: a differentiable function has its slope in each direction determined by that direction and the slopes in the $x^+$ and $y^+$ directions.

The same thing happens in one dimension, we just get too used to it to see it. You might ask, "why does the behavior of a function in the $x^+$ direction determine the behavior in the $x^-$ direction? Why can't a function rise in both the $x^+$ and $x^-$ directions?". Of course, a function can do that, like $y = |x|$ does. But then the function will not be differentiable at the point in question, because it will not be well approximated by the line that is determined by the rate of change in the $x^+$ direction.

The situation in two or more variables is no different. In one dimension, the slope in the $x^+$ direction determines a line. In two dimensions, the slopes in the $x^+$ and $y^+$ directions determine a plane. In either case, we define the function to be differentiable if, around the point we started with, the function is well approximated by that line or plane in every direction that we can go, given the number of dimensions we are working with.


As I said in the comments, this is just the chain rule. Here's the formal argument.

By definition, the directional derivative of $f:\mathbb{R}^n\supset U\to\mathbb{R}$ at $p$ in the direction of $u$ is

$$D_uf(p):=\lim_{t\to0}\frac{f(p+tu)-f(p)}{t}$$

But this is also the definition of derivative of $g(t):=f(p+tu)$ at $t=0$. In other words,

$$D_uf(p)=g'(0)$$

Now, $g$ is a composite function: $g=f\circ\ell$, where

$$\ell(t)=p+tu$$

By the multivariate chain rule (I'm assuming the hypotheses of that theorem are satisfied),

$$g'(0)=(f\circ\ell)'(0)=f'\big(\ell(0)\big)\ell'(0)$$

But $\ell(0)=p$, $\ell'(0)=u$, and $f'(p)=\nabla f(p)$.

We conclude

$$D_uf(p)=g'(0)=\nabla f(p)u$$

(You don't need the dot product in this expression, if you interpret $\nabla f$ correctly as a row vector.)


As for the intuitive idea, the point is that in the infinitesimal limit (which is what derivatives measure), a differentiable function is well-approximated by a linear one, so its rate of change in any direction must be a linear function of the changes in the coordinate directions.


This is because of the definition of the gradient and in general of what it means for a function $f$ to be differentiable (at a point $p_0$ say).

Intuitively, because $f$ is differentiable at $p_0$ is aproximated by a linear function (which in the case of $\mathbb{R^n}$ is exactly the gradient)aproximates really well. So it is logical to believe that in a small neighbourhoud around $p_0$ $f$ behaves almost like a linear function. These aproximations are made more expliciti with the Taylor theorem.

Let me say that this way of thinking, while not rigorous,provides a great deal of intuition behind very importan theorems e.g. the inverse function theorem.