The optimizers in PyTorch

RMAG news

Buy Me a Coffee

*My post explains Batch, Mini-Batch and Stochastic Gradient Descent in PyTorch.

An optimizer is the gradient descent algorism which can find the minimum(or maximum) gradient(slope) of a function by updating(adjusting) a model’s parameters(weight and bias) to minimize the mean(average) of the sum of the losses(differences) between the model’s predictions and true values(train data) during training.

CGD(Classic Gradient Descent)(1847) explaind at (1).
Momentum(1964) explained at (2).
Nesterov’s Momentum(1983) explained at (3).
AdaGrad(2011).
RMSprop(2012) explained at (4).
AdaDelta(2012).
Adam(2014) explained at (5).
AdaMax(2015).
Nadam(2016).
AMSGrad(2018).
AdaBound(2019) explained at (6).
AMSBound(2019).
AdamW(2019).

(1) CGD(Classic Gradient Descent)(1847):

is the optimizer to do the basic gradient descent with no special features. *The learning rate is fixed.
is SGD() in PyTorch. *SGD() in PyTorch is Classic Gradient Descent(CGD) but not Stochastic Gradient Descent(SGD).
can also be called Vanilla Gradient Descent(VGD).
‘s pros:

It’s simple.
It’s based on ohter optimizers.

‘s cons:

It has no special features.

(2) Momentum(1964) (Add-on):

is the add-on optimizer to other optimizers to accelerate(speed up) convergence by mitigating fluctuation, considering the past and current gradients, giving more importance to newer gradients with EWA.
*Memos:

EWA(Exponentially Weighted Average) is the algorithm to smooth a trend(to mitigate the fluctuation of a trend), considering the past and the current values, giving more importance to newer values.
EWA is also called EWMA(Exponentially Weighted Moving Average).

is added to SGD() and Adam() in PyTorch.

‘s pros:

It uses EWA.
It escapes local minima and saddle points.
It creates an accurate model.
It reduces fluctuation.
It reduces overshooting.
It accelerates the convergence.

‘s cons:

(3) Nesterov’s Momentum(1983) (Add-on):

is the Momentum(1964) with the additional function which can calculate the gradient of a slightly ahead position to more accelerate the convergence than Momentum(1964).
is also called Nesterov Accelerated Gradient(NAG).
is added to SGD() and NAdam() in PyTorch.
‘s pros:

It uses EWA.
It more easily escapes local minima and saddle points than Momentum(1964).
It creates a more accurate model than Momentum(1964).
It more reduces fluctuation than Momentum(1964).
It more reduces overshooting than Momentum(1964).
It more accelerates the convergence than Momentum(1964).

‘s cons:

(4) RMSProp(2012):

is the optimizer which can do gradient descent by automatically adapting learning rate to parameters, considering the past and current gradients, giving much more importance to newer gradients than Momentum(1964) with EWA to accelerate convergence by mitigating fluctuation. *The learning rate is not fixed.
‘s learning rate decreases as closing to a global minimum to find the optimal solution precisely.
‘s EWA is a little bit different from Momentum(1964)’s.
is the improved version of AdaGrad(2011) which can do gradient descent by adapting learning rate to parameters, considering the past and current gradients to accelerate convergence by mitigating fluctuation. *The learning rate is not fixed.
‘s pros:

It automatically adapt learning rate to parameters.
It uses EWA.

‘s cons:

is RMSprop() in PyTorch.

(5) Adam(Adaptive Moment Estimation)(2014):

is the combination of Momentum(1964) and RMSProp(2012).
uses Momentum(1964)’s EWA instead of RMSProp(2012)’s.
‘s pros:

It automatically adapt learning rate to parameters.
It uses EWA.

‘s cons:

is Adam() in PyTorch.

(6) AdaBound(2019):

is Adam(2014) with the dynamic bounds(the dynamic upper and lower limit) which can stabilize the convergence to more accelerate the convergence than Adam(2014).
‘s pros:

It automatically adapt learning rate to parameters.
It uses EWA.
It uses the dynamic bounds(the dynamic upper and lower limit).

‘s cons:

is not in PyTorch yet so you can use AdaBound().

Please follow and like us:
Pin Share