An overview of gradient descent optimization algorithms-gradient descent

Gradient descent variants

Batch gradient descent

for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad

Stochastic gradient descent

for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad

Mini-batch gradient descent

for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad

Challenges

Gradient descent optimization algorithms

Momentum

Nesterov accelerated gradient

Adagrad

Adadelta

RMSprop

Adam

AdaMax

Nadam

AMSGrad

Visualization of algorithms

Which optimizer to use?

Parallelizing and distributing SGD

Hogwild!

Downpour SGD

Delay-tolerant Algorithms for SGD

TensorFlow

Elastic Averaging SGD

Additional strategies for optimizing SGD

Shuffling and Curriculum Learning

Batch normalization

Early stopping

Gradient noise

Conclusion

--

--

A Full-Stack web developer, write in Farsi at percept.ir

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store