Using Chain Rule in Matrix Differentiation

Rather than the chain rule, let's tackle the problem using differentials.

Let's use the convention that an upppercase letter is a matrix, lowercase is a column vector, and a greek letter is a scalar. Now let's define some variables and their differentials $$\eqalign{ z &= W_1x+b_1 &\implies dz=dW_1\,x \cr h &= {\rm relu}(z) &\implies dh={\rm step}(z)\odot dz = s\odot dz \cr }$$ where ${\rm step}(z)$ is the Heaviside step function. Both the relu and step functions are applied elementwise to their arguments.

Now take the differential and gradient of the function. $$\eqalign{ \phi &= w_2^Th + \beta_2 \cr &= w_2:h + \beta_2 \cr \cr d\phi &= w_2:dh \cr &= w_2:(s\odot dz) \cr &= (s\odot w_2):dz \cr &= (s\odot w_2):dW_1\,x \cr &= (s\odot w_2)x^T:dW_1 \cr \cr \frac{\partial\phi}{\partial W_1} &= (w_2\odot s)x^T \cr &= \Big(w_2\odot{\rm step}(W_1x+b_1)\Big)x^T \cr\cr }$$ In the above, I used the notations $$\eqalign{ &A:B = {\rm tr}(A^TB) \cr &A\odot B \cr }$$ for the trace/Frobenius and elementwise/Hadamard products, respectively.