Does $(X'X)^{-1}$ always exist?

$(X'X)^{-1}$ is NOT always invertible. Consider X a row vector, then $X'X$ is a matrix with rank 1.

In fact, $(X'X)^{-1}X'$ is the MP pseudo inverse of X, a generalization of inverse $X^{-1}$.


There are some points in your question that may warrant discussing at a conceptual level of what are we trying to achieve, rather than how to do it.

We are in the context of an over-determined system: more equations than unknowns. The unknowns are the parameters or coefficients in the linear system: $\Theta=\begin{bmatrix}\theta_1,\theta_2,\dots,\theta_n\end{bmatrix}^\top,$ with which we want to relate the explanatory variables (features or regressors) in the columns of the model matrix $X$ to the dependent variable or outcome $Y$ as: $Y=X\Theta.$

The problem stems from the fact that these explanatory variables are typically measured many times, once for every subject or example - for instance, in a medical study, the age, weight, height, blood pressure and cholesterol (explanatory variables) may be measured in hundreds of patients (matrix $X$), and attempted to relate to a dependent variable $Y$ (for example, the concentration of some biochemical marker for cancer in blood). Note that his is the opposite problem to an under-determined system in which there are only a few rows of measurements.

The equation $(2)$ is therefore not an option: the matrix $X$ is rectangular and cannot be inverted. If it was invertible, we would actually be in the situation where we have as many observations as unknowns, the points would lie on a point in $m$-dimensional space, and there would be no need to project.

Intead this is what the linear algebra of the subspaces of $X$ look like in an overdetermined problem with linearly independent columns of $X$:

enter image description here

Notice how the rank of $X$ is going to coincide with the number of columns $n,$ and the left nullspace, where all our woes reside, will expand in dimensionality as the number of observations ($m$ rows in the dataset $X$) increases (dim left nullspace $=m - n$ since the rank coincides with $n$):

enter image description here

Since what we have is the $Y$ observations of the independent variable living in $\mathbb R^m,$ but what we want is the vector $\hat \Theta$ that lives in the row space of $X$ we have a problem: although the column space of $X$ can be inverted, vectors that are not strictly in the hyperplane spanned by the $\text{Col}(X)$ will not be invertible to the extent that their components in the left null space or $\text{Null}(X^\top)$ are the part of $X^\top$ that would have been mapped to zero by the errors $\epsilon,$ and hence, cannot be recovered by an inverse matrix.

Projecting is what we need to settle for in a real-life noisy example: we project the vector $Y$ onto the column space $X,$ a $m \times n$ matrix where $m >> n.$ We look for a solution to the orthogonal projection of the outcome vector $ Y$ onto the subspace created by the $m$ columns of $X,$ which form a hyperplane within $\mathbb R^m.$ The projected vector of $Y$ is typically denoted by a hat, $\hat Y.$

enter image description here

This acknowledges that no linear combination of the columns of $X$ can produce exactly $Y.$ If the matrix was square and full rank $m,$ there would be no need to project.

As pointed out multiple times by now, $X^\top X$ can only be inverted when the columns of $X$ are linearly independent. This is almost always the case in noisy, real-life matrices of data. And when this is the case $(X^\top X)^{-1}X^\top$ is a good second best attempt at an inverse: for instance, it produces the identity if multiplied on the right by $X$ as in $(X^\top X)^{-1}X^\top X=I.$ It can easily be proven that it will produce the coefficients of the orthogonal projection, i.e. the error term will be perpendicular to the $\text{Col}(X).$ The coefficients will be thus calculated as

$$\hat \Theta = \left(X^\top X \right)^{-1} X^\top Y$$

The singular value decomposition can be used beyond the cases where $X$ has linearly independent columns to obtain the Moore–Penrose pseudoinverse, $X^+$ discussed above. In cases when there is collinearity (less than full column rank) we can use the pseudoinverse $X^+= V\Sigma^+ U^\top$ to estimate the parameters $\Theta =X^+ Y.$ This is indeed flexible in that for any model matrix $X$ decomposed via SVD into $X=U\Sigma V^\top,$ we can find an inverse

$$X^+=V\Sigma^{-1}U^\top.$$