Intuitive reasoning behind the Chain Rule in multiple variables?

The basic reason is that one is simply composing the derivatives just as one composes the functions. Derivatives are linear approximations to functions. When you compose the functions, you compose the linear approximations---not a surprise.

I'm going to try to expand on Harry Gindi's answer, because that was the only way I could grok it, but in somewhat simpler terms. The way to think of a derivative in multiple variables is as a linear approximation. In particular, let $f: R^m \to R^n$ and $q=f(p)$. Then near $p$, we can write $f$ as $q$ basically something linear plus some "noise" which "doesn't matter" (i.e. is little oh of the distance to $p$). Call this linear map $L: R^m \to R^n$.

Now, suppose $g: R^n \to R^s$ is some map and $r = g(q)$. We can approximate $g$ near $q$ by $r$ plus some linear map $N$ plus some "garbage" which is, again, small.

For simplicity, I'm going to assume that $p,q,r$ are all zero. This is ok, because one can just move one's origin around a bit.

So, as before, applying $f$ to a point near zero corresponds loosely to applying the linear transformation $L$. Applying $g$ to a point near zero corresponds loosely to applying $N$. Hence applying $g \circ f$ corresponds up to some ignorable "garbage" to the map $N \circ L$.

This means that $N \circ L$ is the linear approximation to $g \circ f$ at zero, in particular this composition is the derivative of $g \circ f$.


The correct context of the chain rule is that taking the tangent bundle is functorial. A more down-to-earth answer is provided by working coordinate free using linear algebra.

Suppose $f:X\to Y$ and $g:Y\to Z$ are functions between Banach spaces (these are a generalized version of R^n) such that $f$ is differentiable at $\vec{v}$ and $g$ is differentiable at $f(\vec{v})$ (note that in the general case we must require that their derivatives are toplinear (continuous and linear), since not all linear maps are continuous in the context of infinite dimensional spaces).

Then taking the total differential, we see that the chain rule is equivalent to saying that:

$$T_\vec{v}(g\circ f)=T_{f(\vec{v})}(g) \circ T_\vec{v}(f).$$

The description you get with coordinates comes from this very much simpler presentation (Where T denotes the total differential) as follows:

To derive the formula with coordinates (say, for example, in three dimensions), we present the total differentials (which are linear transformations) as their Jacobian matrices and test along the column vector ${}^t[x,y,z]$. (Where the leftthand exponent t denotes the matrix transpose).

Note: When we present linear operators by their matrices, composition of linear transformations becomes matrix multiplication, and evaluation at a vector $\vec{w}$ becomes righthand multiplication by a column matrix.

The main point is that coordinates obscure what's actually going on here. The beauty of the coordinate free definition is destroyed by the complicated description of matrix multiplication.


Think of it in terms of causality & superposition.

$$z = f(x,y)$$

If you keep $y$ fixed then $\frac{dz}{dt} = \frac{df}{dx} * \frac{dx}{dt}$

If you keep $x$ fixed then $\frac{dz}{dt} = \frac{df}{fy} * \frac{dy}{dt}$.

Superposition says you can just add the two together.