how is backpropagation the same (or not) as reverse automatic differentiation?

In Neural Network training, we want to find a set of weights w that minimizes the error E(N(w,x)-y). (x is the training input, y is the training output, N is the network and E is some error function).

The standard way to do an optimization like this, is gradient descent, which uses the derivative of the network, N' say. We could represent the network as a matrix product and do this manually with matrix calculus, but we can also write (automatic) algorithms.

Backpropagation is a special such algorithm, which has certain advantages. For example it makes it easy to take the derivative only with respect to a selected sample of weights, as is needed for stochastic gradient descent. It also specifies how feed-forward (actual network values) are saved so that they are easily accessible for calculating the needed derivatives.

You should be able to find the exact code for the specific algorithm in text books as well as online.


"What is the function being differentiated? What is the "special case?""

The most important distinction between backpropagation and reverse-mode AD is that reverse-mode AD computes the vector-Jacobian product of a vector valued function from R^n -> R^m, while backpropagation computes the gradient of a scalar valued function from R^n -> R. Backpropagation is therefore a special case of reverse-mode AD for scalar functions.

When we train neural networks, we always have a scalar-valued loss function, so we are always using backpropagation. This is the function being differentiated. Since backprop is a subset of reverse-mode AD, then we are also using reverse-mode AD when we train a neural network.

"Is it the adjoint values themselves that are used or the final gradient?"

The adjoint of a variable is the gradient of the loss function with respect to that variable. When we do neural network training, we use the gradients of the parameters (like weights, biases, etc) with respect to the loss to update the parameters. So we do use the adjoints, but only the adjoints of the parameters (which are equivalent to the gradient of the parameters).