why small L1 norm means sparsity?

It took me an hour yesterday to finally understand this. I wrote a very detailed blog to explain it.

https://medium.com/@shiyan/l1-norm-regularization-and-sparsity-explained-for-dummies-5b0e4be3938a#.nhy58osj5

I’m posting a simple version here.

Yesterday when I first thought about this, I used two example vectors [0.1, 0.1] and [1000, 0]. The first vector is obviously not sparse, but it has the smaller L1 norm. That’s why I was confused, because looking at the L1 norm alone won’t make this idea understandable. I have to consider the entire loss function as a whole.

when you are solving a large vector x with less training data. The solutions to x could be a lot.

enter image description here

Here A is a matrix that contains all the training data. x is the solution vector you are looking for. b is the label vector.

When data is not enough and your model’s parameter size is large, your matrix A will not be “tall” enough and your x is very long. So the above equation will look like this:

enter image description here

let’s use a simple and concrete example. Suppose we want to find a line that matches a set of points in 2D space. We all know that you need at least 2 points to fix a line. But what if the training data has only one point? Then you will have infinite solutions: every line that pass through the point is a solution. Suppose the point is at [10, 5], and a line is defined as a function y = a * x + b. Then the problem is finding a solution to this equation:

enter image description here

Since b = 5 – 10 * a, all points on this following line b = 5 – 10 * a should be a solution:

enter image description here

But how to find the sparse one with L1 norm?

L1 norm is defined as the summation of absolute values of a vector’s all components. For example, if a vector is [x, y], it’s L1 norm is |x| + |y|.

Now if we draw all points that has a L1 norm equals to a constant c, those points should form something (in red) like this:

enter image description here

This shape looks like a tilted square. In high dimension space, it will be a octahedron. Notice that on this red shape, not all points are sparse. Only on the tips, points are sparse. That is, either x or y component of a point is zero. Now the way to find a sparse solution is enlarging this red shape from the origin by giving an ever growing c to “touch” the blue solution line. The intuition is that the touch point is most likely at a tip of the shape. Since the tip is a sparse point, the solution defined by the touch point is also a sparse solution.

enter image description here

As an example, in this graph, the red shape grows 3 times till it touches the blue line b = 5–10 * a. The touch point, as you can see, is at a tip of the red shape. The touch point [0.5, 0] is a sparse vector. Therefore we say, by finding the solution point with the smallest L1 norm (0.5) out of all possible solutions (points on the blue line), we find a sparse solution [0.5, 0] to our problem. At the touch point, the constant c is the smallest L1 norm you could find within all possible solutions.

The intuition of using L1 norm is that, the shape formed by all points whose L1 norm equals to a constant c has many tips (spikes) that happen to be sparse (lays on one of the axises of the coordinate system). Now we grow this shape to touch the solutions we find for our problem (usually a surface or a cross section in high dimension). The probability that the touch point of the 2 shapes is at one of the “tips” or “spikes” of the L1 norm shape is very high. That’s why you want to put L1 norm into your loss function formula, so that you can keep looking for a solution with a smaller c (at the “sparse” tip of the L1 norm). (So in the real loss function case, you are essentially shrinking the red shape to find a touch point, not enlarging it from the origin.)

Does L1 norm always touch the solution at a tip and find us a sparse solution? Not necessarily. Suppose we still want to find a line out of 2D points, but this time, the only training data is a point [1, 1000]. In this case, the solution line b = 1000 -a is in parallel to one of the edges of the L1 norm shape:

enter image description here

Eventually they touch on an edge, not by a tip. Not only you can’t have an unique solution this time, most of your regularized solutions are still not sparse (other than the two tip points.)

But again, the probability for touching a tip is very high. I guess this is even more true for high dimension, real world problems. As when your coordinate system has more axises, your L1 norm shape should have more spikes or tips. It must look like a cactus or a hedgehog! I can’t imagine.

enter image description here

But is the L1 norm the best kind of norm to find sparse solution? Well it turns out that the Lp norm when 0 <= p < 1 gives the best result. This can be explained by looking at the shapes of different norms:

enter image description here

As you can see, when p < 1, the shape is more “scary”, with more sharpen, outbreaking spikes. Whereas when p = 2, the shape becomes a smooth, non-threatening ball. Then why not letting p < 1? That’s because when p < 1, there are calculation difficulties.


Perhaps this is discussing a situation where the possible parameter values are a discrete set. If every nonzero parameter has absolute value at least $\epsilon$, the number of nonzero parameters is at most $1/\epsilon$ times the $\ell^1$ norm of the parameter vector.