Proximal Mapping of Least Squares with $ {L}_{1} $ and $ {L}_{2} $ Norm Terms Regularization (Similar to Elastic Net)

Analytic Solution

Remark
This derivation is extension of dohmatob's solution (Extending details not given in the linked PDF).

Defining:

$$ \hat{x} = \operatorname{prox}_{{\lambda}_{1} {\left\| \cdot \right\|}_{1} + {\lambda}_{2} {\left\| \cdot \right\|}_{2}} \left( b \right) = \arg \min_{x} \left\{ \frac{1}{2} {\left\| x - b \right\|}_{2}^{2} + {\lambda}_{1} {\left\| x \right\|}_{1} + {\lambda}_{2} {\left\| x \right\|}_{2} \right\} $$

This implies:

$$ 0 \in \hat{x} - b + {\lambda}_{1} \partial {\left\| \hat{x} \right\|}_{1} + {\lambda}_{2} \partial {\left\| \hat{x} \right\|}_{2} $$

Where:

$$ u \in \partial {\left\| \cdot \right\|}_{1} \left( \hat{x}_{i} \right) = \begin{cases} \left[-1, 1 \right] & \text{ if } \hat{x}_{i} = 0 \\ \operatorname{sgn}\left( \hat{x}_{i} \right) & \text{ if } \hat{x}_{i} \neq 0 \end{cases} , \; v \in \partial {\left\| \cdot \right\|}_{2} \left( x \right) = \begin{cases} \left\{ z \mid \left\| z \right\|_{2} \leq 1 \right\} & \text{ if } \hat{x} = \boldsymbol{0} \\ \frac{ \hat{x} }{ \left\| \hat{x} \right\|_{2} } & \text{ if } \hat{x} \neq \boldsymbol{0} \end{cases} $$

Notes

The optimization problem tries to minimize $ \hat{x} $ norms while keeping it close to $ b $.
For any element which is not zero in $ \hat{x} $ its sign is identical to the corresponding element in $ b $. Namely $ \forall i \in \left\{ j \mid \hat{x}_{j} \neq 0 \right\}, \, \operatorname{sgn} \left( \hat{x}_{i} \right) = \operatorname{sgn} \left( b \right) $. The reason is simple, If $ \operatorname{sgn} \left( \hat{x}_{i} \right) \neq \operatorname{sgn} \left( b \right) $ then by setting $ \hat{x}_{i} = -\hat{x}_{i} $ one could minimize the distance to $ b $ while keeping the norms the same which is a contradiction to the $ \hat{x} $ being optimal.

Case $ \hat{x} = \boldsymbol{0} $

In this case the above suggests:

$$ b = {\lambda}_{1} u + {\lambda}_{2} v \iff b - {\lambda}_{1} u = {\lambda}_{2} v $$

Since $ {u}_{i} \in \left[ -1, 1 \right] $ and $ \left\| v \right\|_{2} \leq 1 $ one could see that as long as $ \left\| b - {\lambda}_{1} u \right\|_{2} \leq {\lambda}_{2} $ one could set $ \hat{x} = \boldsymbol{0} $ while equality of the constraints hold. Looking for the edge cases (With regard to $ b $) is simple since it cam be done element wise between $ b $ and $ u $. It indeed happens when $ v = \operatorname{sign}\left( b \right) $ which yields:

$$ \hat{x} = \boldsymbol{0} \iff \left\| b - {\lambda}_{1} \operatorname{sign} \left( b \right) \right\|_{2} \leq {\lambda}_{2} \iff \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} \leq {\lambda}_{2} $$

Where $ \mathcal{S}_{ \lambda } \left( \cdot \right) $ is the Soft Threshold function with parameter $ \lambda $.

Case $ \hat{x} \neq \boldsymbol{0} $

In this case the above suggests:

$$ \begin{align*} 0 & = \hat{x} - b + {\lambda}_{1} u + {\lambda}_{2} \frac{ \hat{x} }{ \left\| \hat{x} \right\|_{2} } \\ & \iff b - {\lambda}_{1} u = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \hat{x} \end{align*} $$

For elements where $ {x}_{i} = 0 $ it means $ \left| {b}_{i} \right| \leq {\lambda}_{1} $. Namely $ \forall i \in \left\{ j \mid \hat{x}_{j} = 0 \right\}, \, {b}_{i} - {\lambda}_{1} v = 0 \iff \left| {b}_{i} \right| \leq {\lambda}_{1} $. This comes from the fact $ {v}_{i} \in \left[ -1, 1 \right] $.

This makes the left hand side of the equation to be a Threhsolding Operator, hence:

As written in notes Under the assumption $ \forall i, \, \operatorname{sign} \left( \hat{x}_{i} \right) = \operatorname{sign} \left( {b}_{i} \right) $ the above becomes:

$$ \mathcal{S}_{ {\lambda}_{1} } \left( b \right) = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \hat{x} $$

Looking on the $ {L}_{2} $ Norm of both equation sides yields:

$$ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \left\| \hat{x} \right\|_{2} \Rightarrow \left\| \hat{x} \right\|_{2} = \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} - {\lambda}_{2} $$

Plugging this into the above yields:

$$ \hat{x} = \frac{ \mathcal{S}_{ {\lambda}_{1} } \left( b \right) }{ 1 + \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} - {\lambda}_{2} } } = \left( 1 - \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) $$

Remembering that in this case it is guaranteed that $ {\lambda}_{2} < \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} $ hence the term in the braces is positive as needed.

Summary

The solution is given by:

$$ \begin{align*} \hat{x} = \operatorname{prox}_{{\lambda}_{1} {\left\| \cdot \right\|}_{1} + {\lambda}_{2} {\left\| \cdot \right\|}_{2}} \left( b \right) & = \begin{cases} \boldsymbol{0} & \text{ if } \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} \leq {\lambda}_{2} \\ \left( 1 - \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) & \text{ if } \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} > {\lambda}_{2} \end{cases} \\ & = \left( 1 - \frac{ {\lambda}_{2} }{ \max \left\{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2}, {\lambda}_{2} \right\} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \\ & = \operatorname{prox}_{ {\lambda}_{2} {\left\| \cdot \right\|}_{2} } \left( \operatorname{prox}_{ {\lambda}_{1} {\left\| \cdot \right\|}_{1} } \left( b\right)\right) \end{align*} $$

This matches the derivation in the paper On Decomposing the Proximal Map (See Lecture Video - On Decomposing the Proximal Map) mentioned by @littleO.

Solving as an Optimization Problem

This section will illustrate 3 different methods to the above problem (Very similar to Elastic Net Regularization).

Sub Gradient Method

The Sub Gradient of the above is given by:

$$ \begin{cases} x - b + \operatorname{sgn} \left( x \right ) & \text{ if } x = \boldsymbol{0} \\ x - b + \operatorname{sgn} \left( x \right ) + \frac{x}{ {\left\| x \right\|}_{2} } & \text{ if } x \neq \boldsymbol{0} \end{cases} \in \partial \left\{ \frac{1}{2} {\left\| x - b \right\|}_{2}^{2} + {\lambda}_{1} {\left\| x \right\|}_{1} + {\lambda}_{2} {\left\| x \right\|}_{2} \right\} $$

Then the Sub Gradient iterations are obvious.

The Split Method

This is based on A Primal Dual Splitting Method for Convex Optimization Involving Lipschitzian, Proximable and Linear Composite Terms.
The used algorithm is 3.2 at page 5 where $ L = I $ Identity Operator and $ F \left( x \right) = \frac{1}{2} \left\| x - b \right\|_{2}^{2} $, $ g \left( x \right) = {\lambda}_{1} \left\| x \right\|_{1} $ and $ h \left( x \right) = {\lambda}_{2} \left\| x \right\|_{2} $.
The Prox Operators are given by the $ {L}_{1} $ and $ {L}_{2} $ Threshold Operators.
One must pay attention to correctly factor the parameters of the Prox as the Moreau’s Identity is used.

The ADMM with 3 Blocks Method

Used the Scaled Form as in Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Pg. 15.
The ADMM for 3 Blocks is based on Global Convergence of Unmodified 3 Block ADMM for a Class of Convex Minimization Problems.
The split is made by 3 variables which obey $ A x - B y - C z = 0 $ where $ A $ is just the identity matrix repeated twice (Namely replicates the vector - $ A x = \left[ {x}^{T}, {x}^{T} \right]^{T} $. Then using $ B, C $ one could enforce $ x = y = z $ as required.
Each step, since each variable is multiplied by a matrix, is solved using auxiliary algorithm (It is not "Vanilla Prox"). Yet one could extract a Prox function utilizing this specific for of the matrices (Extracting only the relevant part of the vector).

Results

enter image description here

Code

The code is available (Including validation by CVX) at my StackExchange Mathematics Q2595199 GitHub Repository.

Edit: Closed-form solution for prox of L1 + L2 norm

By first-order optimality conditions,

$$ \begin{split} p = \operatorname{prox}_{\lambda_1\|.\|_1 + \lambda_2\|.\|_2}(a) &\iff a - p \in \partial (\lambda_1\|.\|_1 + \lambda_2\|.\|_2)(p)\\ &\iff a - p = \lambda_1 u + \lambda_2 v,\;\text{for some }(u,v) \in \partial \|.\|_1(p) \times \partial \|.\|_2(p), \end{split} $$

Now, it's a classical computation that for any norm $\|.\|$, one has $$\partial \|.\|(p) = \{z | \|z\|_* \le 1,\; p^Tz = \|p\|\}, $$ where $\|z\|_* := \max_{w, \|w\| \le 1} z^Tw$ defines the dual norm. In particular, one has

$$ \partial \|.\|_2(p) = \begin{cases}\{z | \|z\|_2 \le 1\},&\mbox{ if }p = 0,\\ p/\|p\|_2,&\mbox {else}, \end{cases} $$

and by separability, $\partial \|.\|_1(p) = \times_{i=1}^n \partial |.|(p_i)$, with $$ \partial |.|(p_i) = \begin{cases}[-1,1],&\mbox{ if }p_i = 0,\\ \operatorname{sign}(p_i),&\mbox {else}. \end{cases} $$

The rest of the computation is basic algebra and can be read-off page 6 of the paper - A Sparse Group Lasso:

$$ p = (1 - \lambda_2 / \|\operatorname{ST}_{\lambda_1}(a)\|_2)_+\operatorname{ST}_{\lambda_1}(a) = \operatorname{prox}_{\lambda_2 \|.\|_2}\left(\operatorname{prox}_{\lambda _1\|.\|_1}(a)\right), $$

where ST is the element-wise soft-thresholding operator.

Bonus

OK, now if you've used to playing with proximal operators you should fall off your chair at this point! The prox of the sum of those two norms is just the composition of the respective proximal operators, in a percular order (the prox of the L2 norm is applied last). Behold, the following lemma gives a sufficient condition for such a phenomenon to occur.

Lemma [Theorem 1 of the paper On Decomposing the Proximal Map]. Let $f$ and $g$ be convex l.s.c functions on a Hilbert space $\mathcal H$. A sufficient condition for (A) $\operatorname{prox}_{f + g} = \operatorname{prox}_f \circ \operatorname{prox}_g$ is that (B) $\partial g(\operatorname{prox}_f(p)) \subseteq \partial g(p)\; \forall p \in \mathcal H$.

The OP's problem is then a special case with $f = \lambda_1 \|.\|_1$ and $g = \lambda_2\|.\|_2$.

Proximal Mapping of Least Squares with $ {L}_{1} $ and $ {L}_{2} $ Norm Terms Regularization (Similar to Elastic Net)

Analytic Solution

Case $ \hat{x} = \boldsymbol{0} $

Case $ \hat{x} \neq \boldsymbol{0} $

Summary

Solving as an Optimization Problem

Sub Gradient Method

The Split Method

The ADMM with 3 Blocks Method

Results

Code

Bonus

Tags:

Convex Optimization

Norm

Convex Analysis

Proximal Operators

Subgradient

Related

Recent Posts