why is the least square cost function for linear regression convex

Let $x_i \in \mathbb R^n$ be the $i$th training example and let $X$ be the matrix whose $i$th row is $x_i^T$. Let $y$ be the column vector whose $i$th entry is $y_i$. Define $J:\mathbb R^n \to \mathbb R$ by $$ J(\theta) = \frac{1}{2m} \sum_{i=0}^m (x_i^T \theta - y_i)^2. $$ Notice that $$ J(\theta) = \frac{1}{2m} \| X \theta - y \|_2^2. $$ You can easily check that the function $$ f(\theta) = \frac{1}{2m} \| \theta \|_2^2 $$ is convex by checking that its Hessian is positive definite. (In fact, $$ \nabla^2 f(\theta) = \frac{1}{m} I, $$ where $I$ is the identity matrix.)

A very useful fact to be aware of is that the composition of a convex function with an affine function is convex. Noting that $J(\theta) = f(X \theta - y)$ is in fact the composition of the convex function $f$ with the affine function $\theta \mapsto X \theta - y$, we can invoke this useful fact to conclude that $J$ is convex.


An alternative approach is to compute the Hessian of $J$ directly: $$ \nabla J(\theta) = \frac{1}{m} X^T(X\theta - y) $$ and $$\nabla^2 J(\theta) = \frac{1}{m} X^T X. $$ The matrix $X^T X$ is positive semidefinite, which shows that $J$ is convex.