Gradient descent with constraints

There's no need for penalty methods in this case. Compute the gradient, $g_k=\nabla_xp(x)$, project it onto the tangent plane, $h_k=g_k-(g_k\cdot x_k)x_k$, and normalize it, $n_k=h_k/|h_k|$. Now you can use $x_{k+1}=x_k\cos\phi_k + n_k\sin\phi_k $ and perform a one-dimensional search for $\phi_k$, just like in an unconstrained gradient search, and it stays on the sphere and locally follows the direction of maximal change in the standard metric on the sphere.

By the way, this can be generalized to the case where you're optimizing a set of $n$ vectors under the constraint that they're orthonormal. Then you compute all the gradients, project the resulting search vector onto the tangent surface by orthogonalizing all the gradients to all the vectors, and then diagonalize the matrix of scalar products between pairs of the gradients to find a coordinate system in which the gradients pair up with the vectors to form $n$ hyperplanes in which you can rotate while exactly satisfying the constraints and still travelling in the direction of maximal change to first order; the angles by which you rotate in the hyperplanes are multiples of a single parameter (with the multipliers determined by the eigenvalues), so this is again reduced to a one-dimensional search.

[Edit:] (In response to the suggestion in the comment to use $x_{k+1}=\frac{x_k-\alpha_kg_k}{\|x_k-\alpha_kg_k\|}$ instead)

This is slightly disadvantageous compared to the great-circle solution. You're effectively searching on the same great circle, except in this approach you can only generate one half of it. If the optimum lies in the other half, then the optimal value of your parameter $\alpha_k$ will be $\pm\infty$, whereas in my formulation the compact search space $\phi_k\in[0,2\pi]$ maps to the entire great circle. Also, this formulation doesn't generalize to the case of $n$ orthonormal vectors, since you'd have to perform an orthonormalization at each step.

How to determine $\alpha_k$ is not a (new) problem; in every gradient search you need to have some way to determine the optimum in a one-dimensional search.


The sphere is a particular example of a (very nice) Riemannian manifold.

Most classical nonlinear optimization methods designed for unconstrained optimization of smooth functions (such as gradient descent which you mentioned, nonlinear conjugate gradients, BFGS, Newton, trust-regions, etc.) work just as well when the search space is a Riemannian manifold (a smooth manifold with a metric) rather than (classically) a Euclidean space.

The branch of optimization concerned with that topic is called Riemannian optimization or optimization on manifolds. There is a great reference book on the topic that you can access online for free:

Optimization algorithms on matrix manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton university press, 2008.

https://press.princeton.edu/absil

Some of the theory presented in that book (and some more) is implemented in a Matlab toolbox called Manopt (which I develop with a number of collaborators), also available freely online:

http://www.manopt.org

The tutorial happens to start with an example on the sphere, which could be a handy starting point for the question raised by the OP (assuming Matlab is an option).

EDIT:

I wrote an introduction to optimization on manifolds, available here:

http://www.nicolasboumal.net/book


What we usually do in theory (not in practice) is that to ensure convergence of such algorithms, for a constraint $g : \mathbb R^n \to \mathbb R$ and $f : \mathbb R^n \to \mathbb R$ that we wish to minimize under the condition $g(x) = 0$, we define $$ \overline f : \mathbb R^n \to \overline{\mathbb R} \cup \{ + \infty\}, \qquad \overline f(x) = \begin{cases} + \infty & \text{ if } g(x) = 0 \\ f(x) & \text{ if } g(x) \neq 0. \\ \end{cases} $$ and we minimize $\overline f$ under no constraints, and clearly $f$ has the same minimum as $\overline f$. Now what you want to do in your numerical methods is defining a similar $\overline f$ by giving a very big upper bound on the values of $f$, to make sure your algorithm doesn't go in those directions. You might lose differentiability with this method, which can be quite annoying since you are using differentiation in your steepest descent method.

What you might want to try is using something like a mollifier : the idea is very simple. Instead of making a huge gap between the surface of the sphere and the outside of the surface (I mean outside by "not on the surface, which can mean inside or outside the ball... anyway), you make the gap smooth by using a bump function (see http://en.wikipedia.org/wiki/Bump_function for images) which is $0$ on the sphere and has a very high value when you're at a distance $\delta$ from the sphere. If you call this bump function $B(x)$ that has the property that $B(x) = 0$ whenever $\| x \| = 1$, then define $$ \overline f(x) = f(x) + B(x). $$ Since $B = 0$ on the sphere, $\overline f$ and $f$ will have the same minimum with respect to your constraint. What you're hoping is that the minimum you will find will not be outside your constraint. If that happens, make $\delta$ smaller. This technique might not work if your function behaves badly close to the frontier of the sphere (i.e. goes to very low values when close to the sphere in a weird manner even though there is a minimum on the sphere). If it has nice behavior though I don't expect any problem.

One example of such a bump function would be to define $$ B(x) = \begin{cases} \sin^2 \left( \frac{\| x \| - 1}{\delta} \right) & \text{ if } \left| \| x \| - 1 \right| < \delta \frac {\pi}2 \\ 1 & \text{ if not. } \\ \end{cases} $$ Make $\delta$ small enough and multiply the bump function by any big factor you need to make it practical. I don't know if it's the best idea, but its one I have. Note that my "bump function" is not a bump function in the mathematical sense (it's not compactly supported), but it's just a practical name in this post.

EDIT : Better choices of $B$ are to be expected, as one of my comments mention $B(x) = c(\|x\|-1)^2$, with $c$ large enough. I guess J.M.'s suggestion (reading non-linear programming references) might give further indications.

Hope that helps,