For a ConvNN I trained recently, this is the learning curves when using Adam optimizer with initial learning rate = 0.01:
When using the traditional SGD with initial learning rate = 0.01, momentum = 0.9 and decaying learning rate every 3 epochs with decay rate of 0.96, the learning curves become:
I hope you see the drastic difference. With momentum, we got 10% error rate after 5 epochs. While with Adam, we got ~30% error rate after 20 epochs.
Now it might happen that Adam will work better if we add fancy stuff like Batch Norm and the likes to the network, which I didn’t try. However, when everything else being equal, it feels to me that Adam was a bit aggressive in decreasing the learning rate, which makes learning progress slow after a while.
Since the learning rate is more strongly regulated in Adam, perhaps we can be more lax in setting the initial learning rate? This is the learning curves for Adam with initial learning rate = 0.1
It went wild eventually.
But with initial learning rate = 0.001, Adam gives this:
It is much better now.
Over the years, momentum and decaying learning rate has been my first choice for tuning the learning rate. I sometimes use Adagrad/RMSProp/Adam for cross-checking, but the best results are usually found with momentum, often with less training epochs.
The take-away message is you should really tune your learning rate hard. It is still one of the most important hyper-parameters. Although methods like Adam/Adagrad might make the impression that tuning the learning rate is easy, in fact it is very problem-dependent. Momentum has many more knobs to tune, but once used wisely, it will be very flexible and powerful. Often you will end up to the same ballpark with any of those optimizers.