Does gradient descent converge to a minimum-norm solution in least-squares problems?

From the paper [0] in question:

When optimizing underdetermined problems with multiple global minima, the choice of optimization algorithm can play a crucial role in biasing us toward a specific global minima, even though this bias is not explicitly specified in the objective or problem formulation. For example, using gradient descent to optimize an unregularized, underdetermined least squares problem would yield the minimum Euclidean norm solution, while using coordinate descent or preconditioned gradient descent might yield a different solution. Such implicit bias, which can also be viewed as a form of regularization, can play an important role in learning.

Given fat matrix $\mathrm A \in \mathbb R^{m \times n}$ ($m < n$) and vector $\mathrm b \in \mathbb R^m$, consider the following linear system in $\mathrm x \in \mathbb R^n$

$$\rm A x = b$$

where $\rm A$ has full row rank. Let the singular value decomposition (SVD) of $\rm A$ be as follows

$$\mathrm A = \mathrm U \Sigma \mathrm V^\top = \mathrm U \begin{bmatrix} \Sigma_1 & \mathrm O \end{bmatrix} \begin{bmatrix} \mathrm V_1^\top \\ \mathrm V_2^\top \end{bmatrix} = \mathrm U \Sigma_1 \mathrm V_1^\top$$

The least-norm solution of $\rm A x = b$ is given by

$$\mathrm x_{\text{LN}} := \mathrm A^\top \left( \mathrm A \mathrm A^\top \right)^{-1} \mathrm b = \cdots = \mathrm V_1 \Sigma_1^{-1} \mathrm U^\top \mathrm b$$

where the inverse of $\mathrm A \mathrm A^\top$ exists because $\rm A$ has full row rank.


Gradient descent

Let cost function $f : \mathbb R^n \to \mathbb R$ be defined by

$$f (\mathrm x) := \frac12 \left\| \rm{A x - b} \right\|_2^2$$

whose gradient is

$$\nabla f (\mathrm x) = \rm A^\top \left( A x - b \right)$$

Using gradient descent with step $\mu > 0$,

$$\begin{aligned} {\rm x}_{k+1} &= {\rm x}_k - \mu \nabla f ({\rm x}_k)\\ &= \left( {\rm I} - \mu {\rm A^\top A} \right) {\rm x}_k + \mu {\rm A^\top b}\end{aligned}$$

Hence,

$${\rm x}_k = \left( {\rm I} - \mu {\rm A^\top A} \right)^k {\rm x}_0 + \mu \sum_{\ell = 0}^{k-1} \left( {\rm I} - \mu {\rm A^\top A} \right)^{\ell} {\rm A^\top b}$$

Letting $\rm y := V^\top x$, we rewrite

$$\begin{aligned} {\rm y}_k &= \left( {\rm I} - \mu \Sigma^\top \Sigma \right)^k {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \left( {\rm I} - \mu \Sigma^\top \Sigma \right)^{\ell} \Sigma^\top {\rm U^\top b}\\ &= \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^k & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} \begin{bmatrix} \Sigma_1\\ \mathrm O \end{bmatrix} {\rm U^\top b}\\ &= \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^k & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1\\ \mathrm O\end{bmatrix} {\rm U^\top b} \end{aligned}$$

Choosing $\mu > 0$ such that all eigenvalues of ${\rm I} - \mu \Sigma_1^2$ are strictly inside the unit circle, then ${\rm y}_k \to {\rm y}_{\infty}$, where

$${\rm y}_{\infty} = \begin{bmatrix} \mathrm O & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{\infty} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1\\ \mathrm O\end{bmatrix} {\rm U^\top b}$$

where

$$\mu \sum_{\ell = 0}^{\infty} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1 = \mu \left( {\rm I} - {\rm I} + \mu \Sigma_1^2 \right)^{-1} \Sigma_1 = \Sigma_1^{-1}$$

and, thus,

$${\rm y}_{\infty} = \begin{bmatrix} \mathrm O & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \begin{bmatrix} \Sigma_1^{-1} \\ \mathrm O\end{bmatrix} {\rm U^\top b}$$

Since $\rm x := V y$,

$$\boxed{ \,\\\quad {\rm x}_{\infty} = {\rm V}_2 {\rm V}_2^\top {\rm x}_0 + \underbrace{{\rm V}_1 \Sigma_1^{-1}{\rm U^\top b}}_{= \mathrm x_{\text{LN}}} \quad\\}$$

Therefore, we conclude that if ${\rm x}_0$ is orthogonal to the null space of $\rm A$, then gradient descent will converge to the least-norm solution.


[0] Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro, Implicit Regularization in Matrix Factorization, May 2017.


optimization numerical-optimization convex-optimization quadratic-programming gradient-descent least-squares least-norm matrices svd


If you initialize gradient descent with a point $x_0$ which is a minimizer of the objective function but not a least norm minimizer, then the gradient descent iteration will have $x_k = x_0$ for all $k \geq 0$. We won't move anywhere. So gradient descent does not necessarily converge to a least norm solution.