Relationship between the singular value decomposition (SVD) and the principal component analysis (PCA). A radical result(?)

Suppose we have a bunch of large vectors $x_1,\ldots,x_N$ stored as the columns of a matrix $X$. It would be nice if we could somehow find a small number of vectors $u_1,\ldots,u_s$ such that each vector $x_i$ is (to a good approximation) equal to a linear combination of the vectors $u_1,\ldots, u_s$. This would allow us to describe each of the (very large) vectors $x_i$ using just a small number of coefficients.

So we want to find vectors $u_1,\ldots, u_s$ such that for each $x_i$ we have \begin{equation} x_i \approx c_{i,1} u_1 + c_{i,2} u_2 + \cdots + c_{i,s} u_s \end{equation} for some coefficients $c_{i,1},\ldots, c_{i,s}$.

These $N$ equations ($i$ goes from $1$ to $N$) can be combined into one single matrix equation: \begin{equation} X \approx U C \end{equation} for some matrix $C$. (Here the columns of $U$ are $u_1,\ldots, u_s$.)

Note that the rank of $UC$ is less than or equal to $s$. So $UC$ is a low rank approximation of $X$.

Here is the key fact: the SVD gives us an optimal low rank approximation of $X$ ! That is one of the basic facts about the SVD. That's why the SVD can be used for image compression.

If the SVD of $X$ is expressed as \begin{equation} X = \sum_{i=1}^N \sigma_i u_i v_i^T, \end{equation} where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_N$, then an optimal approximation of $X$ of rank less than or equal to $s$ is \begin{align} X &\approx \sum_{i=1}^s \sigma_i u_i v_i^T \\ &= U \Sigma V^T \\ &= U C \end{align} where $U$ is the matrix with columns $u_1,\ldots, u_s$ and $C = \Sigma V^T$.

Thus, the SVD finds an optimal $U$ for us.

PCA takes as input vectors $x_1,\ldots,x_N$ as well as a small positive integer $s$. PCA demeans the vectors and stores them in the columns of a matrix $X$, then simply computes the SVD $X = U \Sigma V^T$ and returns the first $s$ columns of $U$ as output.


Referring to the answer of the linked question:

1) The principal components are the column vectors of $U=W$. Their significance and order of significance is given by the singular values, most implementations return the singular values ordered by size.

2) $U=W$.

3) ? and see above, $WDW^⊤=XX^⊤=UΣ^2U^⊤$, so that again $W=U$ and $D=Σ^2$

4) The idea is that the sample points lie in a cloud that is essentially flat or even thin, i.e., flat in multiple directions. The assumption of SVD and PCA is that the center of the cloud is the origin, and you want to find out the directions in which the cloud is extended. Then the SVD $X=UΣV^⊤$ can be used to find a low rank approximation $$ X\approx\sum_{i=1}^r \sigma_iu_iv_i^⊤ $$ with error $\sqrt{\sum_{i=r+1}^m\sigma_i^2}$. If that error is small, the vectors $u_1,...,u_r$ give the dominant pattern(s) for the rows of $X$ and the most visible deviations from that pattern. The last of the $u_i$ with the smallest singular values give vectors that are dominantly orthogonal to the data set, that is, the data set is aligned in the orthogonal complement of the least significant vectors.