Should we do learning rate decay for adam optimizer

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network has a specific learning rate associated.

But the single learning rate for each parameter is computed using lambda (the initial learning rate) as an upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).

It's true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn't exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

In my experience it usually not necessary to do learning rate decay with Adam optimizer.

The theory is that Adam already handles learning rate optimization (check reference) :

"We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation."

As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

Should we do learning rate decay for adam optimizer

Tags:

Neural Network

Tensorflow

Related

Recent Posts