Intuitive Proof of the Multivariable Chain Rule

The intuitive proof is very easy. The tough part is in veryifying all the relevant inequalities. To understand the proof, you first need a firm understanding of what derivatives mean.

Given a function $f:V\to W$ between normed vector spaces (think $\Bbb{R}^n, \Bbb{R}^m$ if you wish), and a point $a\in V$, we say $f$ is differentiable at the point $a$, if there exists a (continuous) linear transformation $T:V \to W$ such that \begin{align} (\Delta f_a)(h):= f(a+h) - f(a) = T(h) + \phi(h), \end{align} where $\phi$ is a function such that $\lim\limits_{h\to 0}\frac{\phi(h)}{\lVert h\rVert}=0$. In this case, we can show $T$ is unique, and people denote this by the symbol $Df_a$ or $Df(a)$, or $df(a)$ or $df_a$, or $J_f(a)$, or even $f'(a)$. My choice is the notation $Df_a$ (or $Df_a(\cdot)$, to remind me that it's a linear transformation so that the $(\cdot)$ tells me where to "plug in" a vector $h$ if I want to make an evaluation).

Notice the very nice memorable formula: $\Delta f_a(h) = Df_a(h) + \phi(h)$, which in words says that the a function is differentiable at the point $a$ if the actual change in a function $\Delta f_a(h)$ is equal to a linear approximation $Df_a(h)$ plus a correction term $\phi(h)$ (and the correction term must be small in the sense that it goes to zero faster than $\lVert h \rVert$). Differential calculus is roughly speaking a theory about the local linear approximations to non-linear functions, and the definition above is just a more formal and precise way of stating this.

Now, if you have two differentiable functions $f:V\to W$ and $g:U\to V$ ($U,V,W$ are all real normed vector spaces), and suppose $a\in U$. Now the chain rule says that $f\circ g$ is differentiable and that its derivative is given by such and such formula. Well, let's see very intuitively how we might come up with that: \begin{align} (\Delta(f\circ g)_a)(h) &:= (f\circ g)(a+h) - (f\circ g)(a) \\ &= f(g(a+h)) - f(g(a)) \\ &= f(g(a) + \Delta g_a(h)) - f(g(a)) \\ &= \Delta f_{g(a)}(\Delta g_a(h)) \end{align} So far I haven't done anything; all I did is rewrite things into a more recognizable form. Now, we apply the definition of differentiability: \begin{align} \begin{cases} \Delta f_{g(a)}(k) &= Df_{g(a)}(k) + \phi(k) \\ \Delta g_a(h) &= Dg_a(h) + \gamma(h) \end{cases} \end{align} where $\phi$ and $\gamma$ are the remainder terms for $f$ and $g$ respectively at the points $g(a)$ and $a$ respectively. This yields: \begin{align} \Delta f_{g(a)}(\Delta g_a(h)) &= \Delta f_{g(a)}(Dg_a(h) + \gamma(h)) \\ &= Df_{g(a)}[Dg_a(h) + \gamma(h)] + \phi(Dg_a(h) + \gamma(h)) \\ &= Df_{g(a)}(Dg_a(h)) + Df_{g(a)}(\gamma(h)) + \phi(Dg_a(h) + \gamma(h)) \\ &= (Df_{g(a)} \circ Dg_a)(h) + \text{stuff} \end{align} If all you're interested in is intuition, then we can stop here. What we did is show that \begin{align} (\Delta (f\circ g)_a)(h) = (\Delta f_{g(a)} \circ \Delta g_a)(h) = (Df_{g(a)} \circ Dg_a)(h) + \text{stuff} \end{align} In other words, to linearly approximate the change in a composite function, all you have to do is compose the linear changes of each function $f$ and $g$ (at the appropriate points). The tough part in proving the chain rule goes in showing that the things which I called "stuff" is so small that it goes to zero faster than $\lVert h \rVert$ as $h\to 0$. This is essentially the proof for the chain rule formula \begin{align} D(f\circ g)_a &= Df_{g(a)} \circ Dg_a \end{align} or just to remind yourself that both sides are derivatives, and hence linear transformations, maybe you might like to use the $(\cdot)$ notation to indicate where something should be plugged in to make an evaluation, \begin{align} [D(f\circ g)_a](\cdot) &= [Df_{g(a)} \circ Dg_a](\cdot) = Df_{g(a)}(Dg_a(\cdot)) \end{align}

That's all there is to the chain rule. Now, this is extremely general, and you can use it to extract various special cases.


It is standard linear algebra that every linear transformation between vector spaces can be assigned a matrix once we choose a basis for the domain and target space. In the case of cartesian spaces $\Bbb{R}^n$, we have the standard basis. Now, it shouldn't be too hard to prove that for a differentiable function $f:\Bbb{R}^n\to \Bbb{R}^m$, the matrix representation of $Df_a$, with respect to standard bases is simple the Jacobian matrix of partial derivatives: \begin{align} [Df_a] &= \begin{pmatrix} (\partial_1f_1)(a) & \cdots & (\partial_nf_1)(a) \\ \vdots & \ddots & \vdots \\ (\partial_1f_m)(a) & \cdots & (\partial_nf_m)(a) \end{pmatrix} \end{align} In other words, what we're saying is that if $\xi_j\in \Bbb{R}^n$ denotes the vector with $0$ everywhere except with $1$ in the $j^{th}$ slot and $\eta_i \in\Bbb{R}^m$ denotes the a similar thing, then \begin{align} Df_a(\xi_j) &= \sum_{i=1}^m (\partial_jf_i)(a)\cdot \eta_i \end{align}

Now, since composition of linear transformations corresponds to multiplcation of the respective matrices, we have $[D(f\circ g)_a] = [Df_{g(a)}]\cdot [Dg_a]$, or if we decide to look at the $i,j$ entry of the matrix, we have that \begin{align} (\partial_j(f\circ g)_i)(a) &= \sum_{k=1}^m (\partial_jf_k)(g(a)) \cdot (\partial_kg_i)(a) = \sum_{k=1}^m (\partial_kg_i)(a)\cdot (\partial_jf_k)(g(a)) \end{align}

Usually, people abuse notation slightly by using same letters for different purposes etc; and if you really insist on using Leibniz notation for this, you can say something like "let $z = (z_1, \dots, z_p)$ be a function of $y = (y_1, \dots, y_m)$, and let $y$ be a function of $x=(x_1, \dots x_n)$", then \begin{align} \dfrac{\partial z_i}{\partial x_j} &= \sum_{k=1}^m\dfrac{\partial z_i}{\partial y_k}\cdot \dfrac{\partial y_k}{\partial x_j}. \end{align}


I would highly recommed you read Loomis and Sternberg's Advanced Calculus; the whole of chapter $3$ (well atleast up to section $3.9$). But if that's too much you should definitely take a look at the introductory remarks on page 116, read section $3.6$ (which proves the chain rule at the end, but makes use of the results of $3.5$) and $3.9$ (which explains the relationship between the linear transformation $Df_a$ (they use the notation $df_a$ not $Df_a$) and the matrix elements).


To understand the chain rule, I think that the best is to first consider applications $g,h$ without considering what the domain and codomain are.

You have $$\begin{cases} f(a+h) &\simeq f(a) +f^\prime(a).h\\ g(b+k) &\simeq g(b)+g^\prime(b). \end{cases}$$

If $b=f(a)$, you get $(g \circ f)(a+h) \simeq (g\circ f)(a) +((g^\prime(f(a)) \cdot f^\prime(a))(h)$.

Then you just have to apply that to the special case where $u \mapsto (x_1(u), \dots, x_n(u))$ and $f : (x_1, \dots x_n) \mapsto f(x_1, \dots , x_n)$ applying matrices multiplication.