Intuition Behind Accelerated First Order Methods

I asked this question myself while trying to understand accelerated methods, asked around, and was pointed to this paper by a professor to help gain some intuition: http://statweb.stanford.edu/~candes/papers/NIPS2014.pdf

To summarize: Su et. al. take Nesterov's accelerated gradient method and take the step size to be infinitesimally small to derive the following ODE $$\ddot{X} + \frac{3}{t} \dot{X} + \nabla f(X) = 0$$ with initial conditions $X(0) = x_0$ and $\dot{X}(0) = 0$. By analyzing this ODE, we can get a better idea about what Nesterov's accelerated gradient method is doing. Hope this resource is helpful!


I think the best intuition up to now, is the geometric interpretation of the algorithm by Sebastin Bubeck. It is based on the idea that based on the information available, we know the optimal $x^*$, reside in the intersection of two circles which can be identified from current point $x_k$. For seeing how it is please go to Convex Optimization: Algorithms and Complexity. Also, on his weblog, he described the algorithm simpler, here: I'm a bandit