Why does the gradient commute with taking expectation?

That is not the most general version. It is just convenient for the proof. It basically says, you may differentiate under expectation.

Consider $f:\Omega \times U \to \mathbb R$, where $(\Omega, \mathcal A, P)$ is some probability space (Or a $\sigma$ finite measure space if you like) and $U\subseteq \mathbb R^n$ is open and contains $0$.

  1. Assume for (almost) every $\omega\in\Omega$ the function $z \mapsto f(\omega, z)$ is differentiable and define $g(\omega, z) = D_{z} f(\omega, z)$.

  2. Assume there is a $P$-integrable function $S:\Omega \to \mathbb R$ with $\|g(\;.\;, z)\| \le S$ (almost) everywhere for everywhere $z\in U$ and define $G(z) = \int g(\;.\;, z) dP$.

  3. Assume $G$ is continuous at $0$.

Then, $F$ is differentiable at $0$ with $$ D F(0) = G(0). $$

Proof:

For (almost) every $\omega\in\Omega$ we have $\|g(\omega,\;.\;)\| \le S(\omega) < \infty$. That is $f(\omega, \;.\;)$ is lipschitz continuous and the fundamental theorem of calculus applies. Thus, for $z\in U$ we have \begin{align*} F(z) - F(0) - G(0)z &= \int f(\omega, z) - f(\omega,0) - g(\omega,0)z \; dP(\omega) \\ &= \int \int_{0}^1 g(\omega, tz)z - g(\omega, 0)z \; dt\, dP(\omega) \\ &=\biggl(\underbrace{\int \int_{0}^1 g(\omega, tz) - g(\omega, 0) \; dt \, dP(\omega)}_{R(z)} \biggr) z. \end{align*} Since $$ \int\int_0^1 \| g(\omega, tz) - g(\omega, 0) \| \; dt \, dP(\omega) \le \int\int_0^1 2 S(\omega) \; dt \, dP(\omega) = 2 \int S \, dP < \infty$$ Fubini's theorem applies and for $z\to 0$ we have $$ R(z) = \int_0^1 \int g(\omega, tz) - g(\omega, 0)\; dP(\omega) \, dt = \int_0^1 G(tz) - G(0) \, dt \to 0. $$ That is, $F$ is differentiable at $0$ with $$ DF(0) = G(0). $$

Notes:

  1. The assumptions are obviously satisfied in your case.

  2. We can relax assumption 1 and 2 to: Assume for almost every $\omega\in\Omega$ the function $f(\omega,\;.\;)$ is locally absolutely continuous with $$ \sup_{z\in U} \int \| D_z f(\;.\;, z) \| \, dP < \infty. $$ However, one need to be careful about the domain of $D_z f$.

  3. We can replace the Fubini type argument with a Dominated Convergence type argument with a slightly stronger assumption 3. This would allow every measure instead of only $\sigma$ finite one.


$\beta$ is not a random variable so you can expand the expression and take $\beta$ out as a factor. Then differentiate that expression with respect to $\beta$ and check that the claimed equality holds. It is not more complicated than that. It would be a problem if you couldn't factor $\beta$ out (e.g. $e^{X^{\top}\beta}$). Then you would indeed have to justify interchanging integration and differentiation.