Prove that the trace of the matrix product $U'AU$ is maximized by setting $U$'s columns to $A$'s eigenvectors

Note that with your substitution, finding the optimal $B$ with orthonormal columns is equivalent to finding the optimal $V$. What we're trying to maximize, then, is $$ \operatorname{trace}(B^T\Delta B) = \sum_{i=1}^r b_i^T\Delta b_i $$ Where $b_i$ is the $i$th column of $B$. It turns out that the optimum occurs when each $b_i$ is a standard basis vector.

Because it's a result that makes intuitive sense, it's common to mistakenly assume that it's easy to prove. See, for instance, the mistaken proof given here. Why is the result intuitive? I think that's because it's clear that if we perform a greedy optimization one column $b_i$ at a time, then we'd arrive at the correct result. However, I would say there's no (simple) a priori justification that this should give us the right answer. If you look at the original paper deriving the results required for PCA, you'll see that the required proof takes some Lie-group trickery.

The most concise modern proof is one using the appropriate trace inequality. I would prove the result as follows: to find an upper bound for the maximum, apply the Von-Neumann trace inequality. In particular, we have $$ \operatorname{trace}(U^T\Sigma U) = \operatorname{trace}(\Sigma[UU^T]) \leq \sum_{i=1}^n \lambda_i(\Sigma)\lambda_i(UU^T) $$ where $\lambda_i(M)$ of a positive definite matrix $M$ denote the eigenvalues of $M$ in decreasing order. Note that $UU^T$ is a projection matrix of rank $r$, so we have $\lambda_1(UU^T) = \cdots = \lambda_r(UU^T) = 1$ and $\lambda_{r+1}(UU^T) = \cdots = \lambda_n(UU^T) = 0$. Conclude that $\sum_{i=1}^r\lambda_i(\Sigma)$ is an upper bound.

From there, it suffices to show that taking the columns of $U$ to be the eigenvectors of $\Sigma$ (in the correct order) allows you to attain this upper bound.


To further add to @Omnomnomnom's answer:

The calculations you posted shows, that the original problem is equivalent to the case where you have a diagonal matrix $\Delta$ instead of a merely symmetric $\Sigma$. Note here, that your $B$ is again orthogonal. You can now solve this new problem

$$\max \lbrace {\rm tr} ( B^T \Delta B ) | B \in \mathbb{R}^{n,k}, \, B^t B={\rm Id}_k \rbrace,$$

which is a bit easier than the original, and the maximum will be the same. Also, to get the desired matrix you can then set $U=VB$.

It turns out that the maximum is the sum of the $r$ largest eigenvalues and that $B=(e_1,...,e_r) \in \mathbb{R}^{n,r}$, which then yields $U = VB= (v_1,...,v_r)$, where $v_i$ are ordered eigenvectors of $\Sigma$.

The author of the proof you are following suggests that the solution of the new maximization is obvious. However, Omnomnomnom and I do not agree with that. You could therefore take the approach shown in his or her answer, or you could try to finish your proof. This is actually quite tedious as I will try to show, so I'd suggest to use Omnomnomnom's proof or Nick's suggestion for the upper bound here.

In order to finish your proof, start by calculating $${\rm tr} ( B^T A B ) = \sum_{j=1}^r \sum_{i=1}^n b_{i,j}^2 \lambda_i = \sum_{i=1}^n (\sum_{j=1}^r b_{i,j}^2)\lambda_i = \sum_{i=1}^n h_i \lambda_i$$ where $B=(b_{i,j})$ and $h_i := \sum_{j=1}^r b_{i,j}^2$. You could also have arrived at this point by simply expressing the columns of $U$ as a linear combination of an orthonormal basis of eigenvectors.

Now we'll show two properties of the above linear combination. For the first one, add columns to $B$ to build an orthogonal square matrix $C=[B|R]=(c_{i,j})$ and note that for each $i$

$$0 \le h_i = \sum_{j=1}^r b_{i,j}^2 = \sum_{j=1}^r c_{i,j}^2 \le \sum_{j=1}^n c_{i,j}^2 = 1$$

holds because $CC^T={\rm Id}_n$. Secondly, for the sum of the coefficients we have

$$ \sum_{i=1}^n h_i = \sum_{j=1}^r <b_j,b_j> = r \cdot 1 = r$$.

Now we've arrived at exactly this problem from convex optimization (with $a=(\lambda_1,...,\lambda_n)^T$ and $x_{\min}=0$, $x_{\max}=1$). Instead of solving it with those methods, we can find the maximum by hand. We bound

$$\sum_{i=1}^n h_i \lambda_i \le \sum_{i=1}^{r-1} h_i \lambda_i + \lambda_r \sum_{i=k}^n h_i = \sum_{i=1}^{r-1} h_i \lambda_i + \lambda_r (r-\sum_{i=1}^{r-1}) = r \lambda_r + \sum_{i=1}^{r-1} h_i (\lambda_i - \lambda_r)$$.

Now we can establish an upper bound for our maximization problem:

$$\max \lbrace {\rm tr} ( B^T A B ) | B \in \mathbb{R}^{n,k}, \, B^t B={\rm Id}_k \rbrace \le \max \lbrace \sum_{i=1}^n h_i \lambda_i | 0 \le h_i \le 1, \sum_{i=1}^n h_i = r \rbrace \\ \le \max \lbrace \sum_{i=1}^n h_i \lambda_i | 0 \le h_i \le 1\rbrace \le \max \lbrace r \lambda_r + \sum_{i=1}^{r-1} h_i (\lambda_i -\lambda_r) | 0 \le h_i \le 1\rbrace \le \sum_{i=1}^r \lambda_k$$.

For the last inequality we've set $h_i=1$ since the term in brackets is nonnegative. Since this bound can be attained with $h_1 = ... = h_r = 1$ and $h_{r+1) = ... = h_n = 0$ we have found our maximum and with it the desired $B$ or $U$.