Differentiation with respect to a matrix (residual sum of squares)?

So, what you have here is basically a functional. You're inputting a matrix ($\mathbf{X}$) and a couple vectors ($\mathbf{y}$ and $\beta$), then combining them in such a way that the output is just a number. So, what we need here is called a functional derivative.

Let $\epsilon > 0$ and $\gamma$ be an arbitrary $p \times 1$ vector, then $$\frac{\partial \text{RSS}}{\partial \beta} \equiv \lim_{\epsilon \to 0} \Big((\epsilon \gamma^T)^{-1}\big(\text{RSS}(\beta + \epsilon \gamma) - \text{RSS}(\beta)\big) \Big). $$

We're adding a small, arbitrary vector to $\beta$ and then seeing how that changes $\text{RSS}$. We 'divide' out this arbitrary vector at the end, and I've used the transpose here because $\beta$ and $\gamma$ enter the original functional as multiplication from the right, so coming from the left we use the transpose. All that is left is to evaluate these expressions.

$$\text{RSS}(\beta+\epsilon\gamma) = \left(\mathbf{y}-\mathbf{X}(\beta+\epsilon\gamma)\right)^{T}\left(\mathbf{y}-\mathbf{X}(\beta+\epsilon\gamma)\right) = \left((\mathbf{y}-\mathbf{X}\beta)^{T}-(\mathbf{X}\epsilon\gamma)^T)\right)\left((\mathbf{y}-\mathbf{X}\beta)-\mathbf{X}\epsilon\gamma)\right) $$ $$= (\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta)-(\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\epsilon\gamma-(\mathbf{X}\epsilon\gamma)^T(\mathbf{y}-\mathbf{X}\beta)+(\mathbf{X}\epsilon\gamma)^T\mathbf{X}\epsilon\gamma $$ $$=\text{RSS}(\beta)- \epsilon \big((\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\gamma+(\mathbf{X}\gamma)^T(\mathbf{y}-\mathbf{X}\beta)\big) + \epsilon^2 (\mathbf{X}\gamma)^T\mathbf{X}\gamma $$ So, $$\frac{\text{RSS}(\beta + \epsilon \gamma) - \text{RSS}(\beta)}{\epsilon} =-\big((\mathbf{y}-\mathbf{X}\beta)^{T}\mathbf{X}\gamma+(\mathbf{X}\gamma)^T(\mathbf{y}-\mathbf{X}\beta)\big) + \epsilon (\mathbf{X}\gamma)^T\mathbf{X}\gamma. $$

The third term, than, does not survive in the limit and we are left with $$-\big((\gamma^T \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta))+(\gamma^T \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta))^T\big). $$

However, since both of these terms are just $1 \times 1$ matrices, A.K.A. scalars, then the term and its transpose are equal and we are left with $$\frac{\partial \text{RSS}}{\partial \beta} = -2 \mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) $$


Wow, I asked this two years ago!

Since then, I've learned what the notation means for quick computational purposes.

Let $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$$ $$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{bmatrix}$$ and $$\beta = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{bmatrix}\text{.}$$ Then $\mathbf{X}\beta \in \mathbb{R}^N$ and $$\mathbf{X}\beta = \begin{bmatrix} \sum_{j=1}^{p}b_jx_{1j} \\ \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \implies \mathbf{y}-\mathbf{X}\beta=\begin{bmatrix} y_1 - \sum_{j=1}^{p}b_jx_{1j} \\ y_2 - \sum_{j=1}^{p}b_jx_{2j} \\ \vdots \\ y_N - \sum_{j=1}^{p}b_jx_{Nj} \end{bmatrix} \text{.}$$ Therefore, $$(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta) = \|\mathbf{y}-\mathbf{X}\beta \|^2 = \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)^2\text{.} $$ We have, for each $k = 1, \dots, p$, $$\dfrac{\partial \text{RSS}}{\partial b_k} = 2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ik}\text{.}$$ Then $$\begin{align}\dfrac{\partial \text{RSS}}{\partial \beta} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} \\ \dfrac{\partial \text{RSS}}{\partial b_2} \\ \vdots \\ \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= \begin{bmatrix} -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ -2\sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i2} \\ \vdots \\ \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ &= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\text{.} \end{align}$$ For the second partial, as one might guess: $$\begin{align} \dfrac{\partial \text{RSS}}{\partial \beta^{T}} &= \begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial b_1} & \dfrac{\partial \text{RSS}}{\partial b_2} & \cdots & \dfrac{\partial \text{RSS}}{\partial b_p} \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} & \cdots & \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \end{align}$$ Now we "stack" to take the partial with respect to $\beta$: $$\begin{align} \dfrac{\partial^2\text{RSS}}{\partial \beta\text{ }\partial\beta^{T}} &= \dfrac{\partial}{\partial\beta}\left(\dfrac{\partial \text{RSS}}{\partial \beta^{T}} \right) \\ &= \begin{bmatrix} -2\cdot \dfrac{\partial}{\partial b_1}\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} & \cdots & \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \\ \vdots \\ -2\cdot \dfrac{\partial}{\partial b_p}\begin{bmatrix} \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{i1} & \cdots & \sum_{i=1}^{N}\left(y_i-\sum_{j=1}^{p}b_jx_{ij}\right)x_{ip} \end{bmatrix} \end{bmatrix} \\ &= \begin{bmatrix} -2\begin{bmatrix} -\sum_{i=1}^{N}x_{i1}^2 & \cdots & -\sum_{i=1}^{N}x_{i1}x_{ip} \end{bmatrix} \\ \vdots \\ -2\begin{bmatrix} -\sum_{i=1}^{N}x_{i1}x_{ip} & \cdots & -\sum_{i=1}^{N}x_{ip}^2 \end{bmatrix} \end{bmatrix} \\ &= 2\mathbf{X}^{T}\mathbf{X}\text{.} \end{align}$$