Connection between PCA and linear regression

The difference between PCA and regression can be interpreted as being mathematically only one additional multiplication with an inverse matrix...

Here is a correlation matrix with three groups of variables:

group g1: $x_{1,1},x_{1,2}$ correlated         
group g2: $x_{2,1},x_{2,2},x_{2,3}$ correlated         
group g3: $y$ composed by the variables of group g1 and g2         

The correlation matrix looks like

cor         x1_1    x1_2    x2_1    x2_2    x2_3    ___y
-----------------------------------------------------------
x1_1        1.000   0.976   0.472   0.444   0.430   0.642
x1_2        0.976   1.000   0.594   0.567   0.538   0.767
x2_1        0.472   0.594   1.000   0.998   0.979   0.919
x2_2        0.444   0.567   0.998   1.000   0.986   0.917
x2_3        0.430   0.538   0.979   0.986   1.000   0.904
___y        0.642   0.767   0.919   0.917   0.904   1.000
-----------------------------------------------------------

This is the loadings-matrix in its initial triangular cholesky form when not yet rotated to PC-position:

tri         f1       f2      f3      f4      f5      f6
-----------------------------------------------------------
x1_1        1.000   0.000   0.000   0.000   0.000   0.000
x1_2        0.976   0.218   0.000   0.000   0.000   0.000
x2_1        0.472   0.612   0.634   0.000   0.000   0.000
x2_2        0.444   0.615   0.649   0.054   0.000   0.000
x2_3        0.430   0.543   0.700   0.119   0.128   0.000
___y        0.642   0.644   0.350   0.156   0.112   0.117
-----------------------------------------------------------

In pca we do not distinguish between independent and dependent variables; so we might do a rotation to PCA-position, where then the first axis/column denotes the first principal component and so on.

[22] pca = rot(dr,"pca")

pca         f1        f2      f3      f4      f5      f6
-----------------------------------------------------------
x1_1        0.714   -0.692   0.100   0.041  -0.020   0.005
x1_2        0.810   -0.584  -0.031  -0.047   0.025  -0.007
x2_1        0.948    0.300   0.063  -0.081   0.003   0.018
x2_2        0.940    0.333   0.050  -0.049  -0.015  -0.019
x2_3        0.926    0.351   0.072   0.114   0.016  -0.001
___y        0.973    0.045  -0.226   0.027  -0.010   0.004
-----------------------------------------------------------

We see, that all variables have one common factor, but we might also see, that only two or three factors are "relevant". A quartimax-rotation might locate the three main factors better related to the variable-groups

[26] qua = rot(pca,"quartimax",1..6,1..3)

qua.max       f1      f2      f3      f4      f5      f6
-----------------------------------+------------------------
x1_1        0.373   0.925   -0.060   0.041  -0.020   0.005
x1_2        0.505   0.859    0.068  -0.047   0.025  -0.007
--------
x2_1        0.988   0.112   -0.059  -0.081   0.003   0.018
x2_2        0.994   0.078   -0.048  -0.049  -0.015  -0.019
x2_3        0.989   0.056   -0.071   0.114   0.016  -0.001
--------
___y        0.908   0.342    0.240   0.027  -0.010   0.004
-----------------------------------+------------------------

We see that we have two main groups with high in-group correlations, saying they measure nearly the same, in the x-variables and a less sharp separated "group" with only the y-variable which is correlated to both groups but has still an individual variance (in factor f3).
This is classical PCA with Quartimax/Varimax-rotation, the "little jiffy"-procedure.

Now we move on to regression. In regression we define one variable as dependent, in our case the variable y. We are interested, in which way y is composed by the independent variables; a still pca-inherent point of view would be that we find the pca of the independent variables only and leve the factor f6, which shows a part of y-variance which is uncorrelated , alone as taken from the initial triangular cholesky-factor.

[29] pca = rot(dr,"pca",1..5)

pca.reg      f1       f2      f3      f4      f5      f6
---------------------------------------------------+--------
x1_1        0.729   -0.680   0.067  -0.048  -0.002  0.000
x1_2        0.816   -0.571  -0.067   0.055   0.001  0.000
-------------------
x2_1        0.946    0.315  -0.066  -0.033   0.019  0.000
x2_2        0.936    0.348  -0.041  -0.013  -0.023  0.000
x2_3        0.923    0.366   0.117   0.036   0.004  0.000
---------------------------------------------------+--------
___y        0.957    0.057  -0.076   0.245  -0.039  0.117
---------------------------------------------------+--------

Still the axes show the "factors" and how each variable is composed by that common (f1 to f5) or individual (f6) factors.

Regression asks now for composition of y not by the factors/coordinates on the axes but by the coordinates on variables x if they are taken as axes.
Happily we need only multiply the current loadings-matrix by the inverse of the x-submatrix to get axes defined by the x and have the "loadings" of y on x:

[32] B = (pca[*,1..5]*inv(pca[1..5,1..5]) ) || pca[*,6]

reg.B       x1_1    x1_2    x2_1    x2_2    x2_3    ___y
---------------------------------------------------+--------
x1_1        1.000   0.000   0.000   0.000   0.000   0.000
x1_2        0.000   1.000   0.000   0.000   0.000   0.000
-------------------
x2_1        0.000   0.000   1.000   0.000   0.000   0.000
x2_2        0.000   0.000   0.000   1.000   0.000   0.000
x2_3        0.000   0.000   0.000   0.000   1.000   0.000
---------------------------------------------------+--------
___y        -1.442  1.989   -1.394  0.955   0.876   0.117

We see, that each of the axes is identified with one of the variables and also the "loadings" of y on that axes. But because the axes show now the directions of the x the loadings of y in the last row are now also the "regression"-weights/ coefficients, and the regression weights are now

B = [-1.442  1.989   -1.394  0.955   0.876 ]

(Because of the strong correlations in the groups the regression-weights are above 1 and also the signs are alternating. But this is not much of concern here in that methodic explanation)


[Update]
Relating PCA and regression in this way, there occurs very fluently another instructive example which might improve intuition. This is the problem of multicollinearity, which if occurs in regression is a problem for the researcher, but if occurs in PCA only improves the validity of estimation of separate components and the loadings of the items on such (latent) constructs.
The means, which I want to introduce here, is the "main direction" of the multicollinear items (which of course is the first principal component) respectively the two sets of independent items $x1$ and $x2$. We can introduce latent variables which mark the pc's of the two sets of x-items. This can practically be done applying pc-rotation with maximizing criterion taken from the sets only:

[32] pc1 = rot(tri,"pca",1..2,1..5)  //rotating for pc of items 1..2 only
[33] pc1 = {pc1,marksp(pc1,{1,2})}   //add markers for that pc's 


PC1       pc1_1     pc1_2     c3        c4        c5        c6
--------------------------------------------------------------
x1_1      0.994    -0.109     0         0         0         0
x1_2      0.994     0.109     0         0         0         0
--------------------------------------------------------------
x2_1      0.562     0.571     0.599     0         0         0
x2_2      0.536     0.578     0.613     0.054     0         0
x2_3      0.521     0.516     0.657     0.123     0.125     0
--------------------------------------------------------------
y         0.722     0.575     0.316     0.151     0.11      0.116
===============================================================
pc1_1     1         0         0         0         0         0
pc1_2     0         1         0         0         0         0
===============================================================

If we were looking at the system of $x1_1,x1_2$ and $y$ only, we had already the beta-values for the pc's of that $x1$-item set as $b_{pc1_1}=0.722$ and $b_{pc1_2}=0.575$ - no hazzle because of (multi)collinearity!

The same can be done with the second set of independent items $x2_1,x2_2,x2_3$:

[37] pc2 = rot(pc1,"pca",3..5,1..5)                                   
[38] pc2 = {pc2,marksp(pc2,{1,2,3})}                               

PC2       pc2_1     pc2_2     pc2_3     c4        c5        c6
--------------------------------------------------------------
x1_1      0.477    -0.124    -0.464     0.729     0.101     0
x1_2      0.599    -0.189    -0.389     0.636     0.221     0
--------------------------------------------------------------
x2_1      0.997    -0.078    -0.022     0         0         0
x2_2      0.999    -0.041     0.027     0         0         0
x2_3      0.993     0.119    -0.005     0         0         0
--------------------------------------------------------------
y         0.923    -0.029    -0.058     0.212     0.294     0.116
===============================================================
pc1_1     0.542    -0.157    -0.429     0.687     0.162     0
pc1_2     0.557    -0.300     0.343    -0.424     0.55      0
--------------------------------------------------------------
pc2_1     1         0         0         0         0         0
pc2_2     0         1         0         0         0         0
pc2_3     0         0         1         0         0         0
===============================================================

The beta-value for the first pc of the second set of items (in a model without the first set) were $b_{pc2_1}=0.923$ which is more than that $b_{pc1_1}=0.722$ for the first pc of the first set of independent items.

Now to see the betas using the joint model requires again only the inversion of the submatrix of the loadings of the whole set of 5 pc-markers and postmultiplying the first 5 columns with that. This gives us the "loadings", if the 5 pcs are taken as axes of a coordinate system. We get

[42] beta = pca1[*,1..5]* inv(pca1[7..11,1..5]) || pca1[*,6]        

beta      pc1_1     pc1_2     pc2_1     pc2_2     pc2_3     c6
--------------------------------------------------------------
x1_1      0.994    -0.109     0         0         0         0
x1_2      0.994     0.109     0         0         0         0
--------------------------------------------------------------
x2_1      0         0         0.997    -0.078    -0.022     0
x2_2      0         0         0.999    -0.041     0.027     0
x2_3      0         0         0.993     0.119    -0.005     0
--------------------------------------------------------------
y         0.540     0.375     0.421     0.169     0.045     0.116
===============================================================
pc1_1     1         0         0         0         0         0
pc1_2     0         1         0         0         0         0
--------------------------------------------------------------
pc2_1     0         0         1         0         0         0
pc2_2     0         0         0         1         0         0
pc2_3     0         0         0         0         1         0
===============================================================

In short:

beta      pc1_1     pc1_2  |   pc2_1     pc2_2     pc2_3   |  c6
---------------------------+-------------------------------+--------
y         0.540     0.375  |   0.421     0.169     0.045   |  0.116
===========================+===============================|========

In that joint model the "main direction" of the first set of independents has a beta-weight of 0.540 and the "main direction" of the second set of 0.421 . The value at $c6$ is here only for completeness: its square $0.116^2$ is the unexplained variance of the dependent item $y$.


No. PCA returns mixtures of (typically dependent) variables that most contribute to observed variance. Linear regression does not mix variables while expressing the output variables as linear combinations of the input variables. (Linear combinations? Yes. Mixtures of input and output variables? No.)

PCA is computed by singular value decomposition. Linear regression is not. Consequences: linear regression misbehaves if two or more input variables are correlated (i.e., the system is rank deficient), typically producing outputs sensitively dependent on noise. PCA merges the (almost) indistinguishable variables together and continues giving sane results. This merging of strongly correlated variables can result in reduction in the apparent dimensionality of the system, generally improving explanatory power and prediction.

The process of using PCA for linear regression is called Principal Component Regression.