Why is the denominator $N-p-1$ in estimation of variance?

The current accepted answer is flawed, as it implicitly assumes that the error of the model $\varepsilon$ is Gaussian (otherwise you need not have $\sum(y_i-\hat{y}_i)^2\sim\sigma^2\chi^2_{N-p-1}$).

Here's a proof with the general assumption that $\varepsilon$ has mean $0$ and variance $\sigma^2 I_N$.

First note that $\sum(y_i-\hat{y}_i)^2=\|y-X\hat\beta\|^2$.

We have $$\begin{align} y-X\hat\beta &= X\beta +\varepsilon -X(X^TX)^{-1}X^T(X\beta +\varepsilon)\\ &=X\beta +\varepsilon - X\beta -X(X^TX)^{-1}X^T\varepsilon\\ &= (I_N-H)\varepsilon\end{align}$$ where $H=X(X^TX)^{-1}X^T$ is the hat matrix. It's easy to check that $H^T=H$ and $H^2=H$ (indeed the hat matrix is merely the orthogonal projection on $\operatorname{Im}X$).

Hence $\begin{aligned}[t]E( \|y-X\hat\beta\|^2) &= E(\varepsilon^T(I_N-H)^T (I_N-H)\varepsilon)=E(\varepsilon^T(I_N-H)\varepsilon) \end{aligned}$

Note that $\varepsilon^T(I_N-H)\varepsilon=\sum_{i,j} \varepsilon_i\varepsilon_j (\delta_{ij}-H_{ij})$, thus $$E(\varepsilon^T(I_N-H)\varepsilon)=\sum_{i,j} \sigma^2\delta_{ij} (\delta_{ij}-H_{ij})=\sigma^2(N-\operatorname{tr}H)$$

Note that $\operatorname{tr}H =\operatorname{tr}(X(X^TX)^{-1}X^T)=\operatorname{tr}(X^TX(X^TX)^{-1})=\operatorname{tr}(I_{p+1})=p+1 $

Putting everything together, $E( \|y-X\hat\beta\|^2)=\sigma^2(N-p-1)$


You can show that $\sum(y_i-\hat{y}_i)^2\sim\sigma^2\chi^2_{N-p-1}$. As expectation of a $\chi^2_{N-p-1}$ is $(N-p-1)$. Hence $\mathbb{E}(\frac{1}{N-p-1}\sum(y_i-\hat{y}_i)^2)=\sigma^2$.

$N-p-1$ is in the denominator to make the estimator unbiased.