Application of the chain rule to $3$-layers neural network

Your final expression and solution are (essentially) correct.

First, a remark on notation: If one wants to be more accurate, one would also add the point for Jacobi derivatives, for example some people write $(J_\theta f)(x,\theta)$ or something similar to refer to the partial Jacobi derivative of $f$ with respect to $\theta$ at the point $(x,\theta)$.

You should also be careful with notation such as $JL^2(L^1(x,\theta^1),\theta^2)$. Here, you use it to describe the Jacobi matrix of the function $h(x,\theta^1,\theta^2):=L^2(L^1(x,\theta^1),\theta^2)$, so it would be good to explain this. Other people could be confused, because to them it could mean the Jacobi matrix of $L^2$ at the point $(L^1(x,\theta^1),\theta^2)$, which would be a different mathematical object.

But in my opinion, its ok to use alternative notation if you explain what you use (or use it in a context where everyone uses the same notation).

An alternative solution would be to start with $$ J_\theta f= \begin{pmatrix} J_{\theta^1}f & J_{\theta^2}f & J_{\theta^3}f \end{pmatrix} $$ and then continue with calculating the partial Jacobi derivatives. This approach is not necessarily faster, but maybe it is easier to be convinced that your result is correct with this approach, because you have to "fight" fewer variables at once.


Let me derive backpropagation for the above 3-layer feedforward neural network. For more general networks (e.g. residual ones), the idea is similar and I'll give some hints for dealing with those at the end of this answer.

Suppose that we feed an input $x\in\mathbb{R}^d$ to the network to produce an output loss value $L\in\mathbb{R}$ (given by a differentiable loss function $\ell$): \begin{align} x_1 &= f_1(x, \theta_1),\\ x_2 &= f_2(x_1, \theta_2),\\ x_3 &= f_3(x_2, \theta_3),\\ L &= \ell(x_3). \end{align} The goal is to compute the gradient $\left(\frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \frac{\partial L}{\partial \theta_3}\right)$.

Some notes on the notation:

  1. I always prefer subscripts than superscripts whenever possible, to avoid confusion with exponents. In our case, it's possible so I use subscripts.
  2. We could stop at $x_3$ and derive backpropagation from it, but I added a scalar loss function to reflect the real use case, and also for you to understand better my answer to your other question.
  3. I use the explicit partial derivatives instead of your Jacobian notation, because on one hand, I find yours error prone, and on the other hand, many deep learning references seem to use the same notation as what I'm using so hopefully this will make you more confortable reading them later.

Now let's backprop. The computation graph in this case is simple: $\require{AMScd}$ \begin{CD} x @>>> x_1 @>>> x_2 @>>> x_3 @>>> L\\ @. @AAA @AAA @AAA\\ @. \theta_1 @. \theta_2 @. \theta_3 \end{CD} We go backward from the last node and use the chain rule: \begin{align} \frac{\partial L}{\partial x_3} &= \ell'(x_3),\\ \frac{\partial L}{\partial \theta_3} &= \frac{\partial x_3}{\partial \theta_3}\frac{\partial L}{\partial x_3},\\ \frac{\partial L}{\partial x_2} &= \frac{\partial x_3}{\partial x_2}\frac{\partial L}{\partial x_3},\\ \frac{\partial L}{\partial \theta_2} &= \frac{\partial x_2}{\partial \theta_2}\frac{\partial L}{\partial x_2},\\ \frac{\partial L}{\partial x_1} &= \frac{\partial x_2}{\partial x_1}\frac{\partial L}{\partial x_2},\\ \frac{\partial L}{\partial \theta_1} &= \frac{\partial x_1}{\partial \theta_1}\frac{\partial L}{\partial x_1}. \end{align} That was the exact order of how the derivative computation should be implemented. One can plug the intermediate terms into the main terms (i.e. the ones w.r.t. $\theta$) to get direct formula, for example \begin{equation}\label{Lt1} \frac{\partial L}{\partial \theta_1} = \frac{\partial x_1}{\partial \theta_1} \frac{\partial x_2}{\partial x_1} \frac{\partial x_3}{\partial x_2}\frac{\partial L}{\partial x_3}. \tag{1} \end{equation} However, this should not be used for an implementation as it will re-compute same quantities multiple times.

Some other important notes, again on the notation:

  1. Because $x_3 = f_3(x_2, \theta_3)$, one can view $x_3$ as a function of $x_2$ and $\theta_3$, i.e. $x_3 := x_3(x_2, \theta_3)$ and thus $\frac{\partial x_3}{\partial \theta_3}$ denotes the partial derivative of that function with respect to $\theta_3$ (evaluated at $(x_2, \theta_3)$). Similarly for the other quantities. Using the Jacobian notation, \eqref{Lt1} for example can be written as: \begin{align} \frac{\partial L}{\partial \theta_1} &= (J_{\theta_1} f_1) (x, \theta_1) \times (J_{x_1} f_2) (x_1, \theta_2) \times (J_{x_2} f_3) (x_2, \theta_3) \times (J \ell)(x_3). \end{align} If with slight abuse of notation we omit the values at which the functions are evaluated, then the above become: \begin{align} \frac{\partial L}{\partial \theta_1} &= J_{\theta_1} f_1 \times (J_{x_1} f_2) \times (J_{x_2} f_3) \times (J \ell). \end{align} I hope this makes it easier for you because you seem to be familiar with the Jacobian notation. You can now easily compare this result with yours.

  2. A more rigorous presentation should use the total derivative notation instead of using $\partial$ everywhere, like what I have done (and like in many deep learning references). For example, one should write: $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{\theta_1}} = \frac{\partial x_1}{\partial \theta_1}\frac{\dv{L}}{\dv{x_1}}. \end{equation}

For a more general computation graph, the principle is the same. One has to compute recursively $\frac{\dv{L}}{\dv{x_i}}$ for every nodes $x_i$ of the graph. The recursion lies in the fact that, the derivative with respect to one node can be computed once the derivatives of all its children have been computed, using the chain rule: \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}

Let me know if you have further questions.


Khue gave a good answer, and this is a comment to it.

Khue seems to be using the alternative convention for Jacobians, in which everything is transposed (cf. gradient vs derivative): $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}

The author of the question, warm_fish, uses the standard definition, in which case the chain rule is \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\dv{L}}{\dv{x_j}} \frac{\partial x_j}{\partial x_i}. \end{equation}

[need 50 reputation score to comment, such an annoyance! Please bump my rep up, so that I could write comments as comments!]