Introduction
Optimization algorithms form the backbone of machine learning. Whether training a simple linear regression model or a deep neural network with millions of parameters, the choice of optimization algorithm significantly impacts training speed, convergence quality, and final model performance. Among these algorithms, gradient descent and its variants dominate practical applications.
In 2026, understanding optimization algorithms remains essential for machine learning practitioners. This article provides a comprehensive exploration of gradient descent variants, from basic batch gradient descent to sophisticated adaptive methods like Adam. We examine the mathematical foundations, implementation details, and practical guidance for selecting and tuning optimizers.
Fundamentals of Gradient Descent
The Core Idea
Gradient descent optimizes a differentiable objective function by iteratively moving in the direction of the negative gradient. Given an objective function J(θ) with parameters θ, the update rule is:
θ = θ - η × ∇J(θ)
where η is the learning rate (step size) and ∇J(θ) is the gradient of the objective function with respect to parameters.
The gradient points in the direction of steepest ascent, so moving in the opposite direction decreases the function value. This simple principle underlies virtually all neural network training.
Learning Rate
The learning rate η is perhaps the most critical hyperparameter:
- Too large: Updates overshoot minima, causing divergence or oscillation
- Too small: Convergence is painfully slow, potentially getting stuck in local minima
- Just right: Efficient convergence to a good solution
Variants of Gradient Descent
Three primary variants differ in how much data is used per gradient computation:
Batch Gradient Descent (BGD) computes the gradient using the entire dataset:
θ = θ - η × ∇J(θ; all training samples)
Pros: Stable convergence, smooth gradient estimates Cons: Slow for large datasets, requires entire dataset in memory
Stochastic Gradient Descent (SGD) computes the gradient using a single sample:
θ = θ - η × ∇J(θ; x_i, y_i)
Pros: Fast iterations, can escape local minima, online learning Cons: Noisy gradients, requires many iterations
Mini-batch Gradient Descent balances both approaches:
θ = θ - η × ∇J(θ; batch of b samples)
This is the standard approach in deep learning, typically using batch sizes of 32, 64, 128, or 256.
import numpy as np
def batch_gradient_descent(X, y, lr=0.01, epochs=100):
m, n = X.shape
theta = np.zeros(n)
for _ in range(epochs):
gradient = (1/m) * X.T @ (X @ theta - y)
theta -= lr * gradient
return theta
def stochastic_gradient_descent(X, y, lr=0.01, epochs=100):
m, n = X.shape
theta = np.zeros(n)
for _ in range(epochs):
for i in range(m):
xi = X[i:i+1]
yi = y[i:i+1]
gradient = xi.T @ (xi @ theta - yi)
theta -= lr * gradient
return theta
def mini_batch_gradient_descent(X, y, lr=0.01, epochs=100, batch_size=32):
m, n = X.shape
theta = np.zeros(n)
for _ in range(epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, m, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
gradient = (1/len(X_batch)) * X_batch.T @ (X_batch @ theta - y_batch)
theta -= lr * gradient
return theta
Momentum-Based Methods
The Problem with Vanilla Gradient Descent
In directions with small gradients, vanilla gradient descent makes slow progress. Additionally, oscillations occur when gradients change direction rapidly, as in ravines—elongated curved structures common in optimization landscapes.
Momentum
Momentum simulates a ball rolling down a hill, accumulating velocity in consistent directions:
v = γ × v + η × ∇J(θ) θ = θ - v
where γ (typically 0.9) is the momentum coefficient. Momentum accelerates in consistent directions and dampens oscillations.
def gradient_descent_with_momentum(X, y, lr=0.01, gamma=0.9, epochs=100):
m, n = X.shape
theta = np.zeros(n)
v = np.zeros(n)
for _ in range(epochs):
gradient = (1/m) * X.T @ (X @ theta - y)
v = gamma * v + lr * gradient
theta -= v
return theta
Nesterov Accelerated Gradient (NAG)
NAG looks ahead before computing the gradient:
v = γ × v + η × ∇J(θ - γ × v) θ = θ - v
This “look-ahead” mechanism provides better convergence than standard momentum.
def nesterov_accelerated_gradient(X, y, lr=0.01, gamma=0.9, epochs=100):
m, n = X.shape
theta = np.zeros(n)
v = np.zeros(n)
for _ in range(epochs):
theta_ahead = theta - gamma * v
gradient = (1/m) * X.T @ (X @ theta_ahead - y)
v = gamma * v + lr * gradient
theta -= v
return theta
Adaptive Learning Rate Methods
Adaptive methods adjust learning rates per parameter based on gradient history, handling sparse features and varying gradient scales.
Adagrad
Adagrad scales learning rates inversely to the square root of accumulated gradients:
G = G + g² θ = θ - η/√(G + ε) × g
where g is the gradient and ε prevents division by zero.
def adagrad(X, y, lr=0.1, eps=1e-8, epochs=100):
m, n = X.shape
theta = np.zeros(n)
G = np.zeros(n)
for _ in range(epochs):
gradient = (1/m) * X.T @ (X @ theta - y)
G += gradient ** 2
theta -= (lr / np.sqrt(G + eps)) * gradient
return theta
Adagrad adapts well to sparse features but learning rates monotonically decrease.
RMSprop
RMSprop divides by exponential moving average of squared gradients:
E[g²] = γ × E[g²] + (1-γ) × g² θ = θ - η/√(E[g²] + ε) × g
This prevents learning rate decay while adapting to gradient magnitudes.
def rmsprop(X, y, lr=0.01, gamma=0.9, eps=1e-8, epochs=100):
m, n = X.shape
theta = np.zeros(n)
Eg2 = np.zeros(n)
for _ in range(epochs):
gradient = (1/m) * X.T @ (X @ theta - y)
Eg2 = gamma * Eg2 + (1-gamma) * (gradient ** 2)
theta -= (lr / np.sqrt(Eg2 + eps)) * gradient
return theta
Adam
Adam (Adaptive Moment Estimation) combines momentum and RMSprop:
m = β₁ × m + (1-β₁) × g (first moment) v = β₂ × v + (1-β₂) × g² (second moment) m̂ = m/(1-β₁ᵗ) v̂ = v/(1-β₂ᵗ) θ = θ - η × m̂/√(v̂ + ε)
def adam(X, y, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, epochs=100):
m, n = X.shape
theta = np.zeros(n)
m = np.zeros(n)
v = np.zeros(n)
for t in range(1, epochs+1):
gradient = (1/m) * X.T @ (X @ theta - y)
m = beta1 * m + (1-beta1) * gradient
v = beta2 * v + (1-beta2) * (gradient ** 2)
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
theta -= (lr * m_hat) / (np.sqrt(v_hat) + eps)
return theta
AdamW (Adam with Weight Decay)
AdamW decouples weight decay from gradient-based updates:
θ = θ - η × (m̂/√(v̂ + ε) + λ × θ)
This provides better generalization than L2 regularization in Adam.
def adamw(X, y, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8,
weight_decay=0.01, epochs=100):
m, n = X.shape
theta = np.zeros(n)
m = np.zeros(n)
v = np.zeros(n)
for t in range(1, epochs+1):
gradient = (1/m) * X.T @ (X @ theta - y)
m = beta1 * m + (1-beta1) * gradient
v = beta2 * v + (1-beta2) * (gradient ** 2)
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
theta -= lr * (m_hat / (np.sqrt(v_hat) + eps) + weight_decay * theta)
return theta
Learning Rate Scheduling
Why Schedule Learning Rates?
Gradually reducing the learning rate improves convergence:
- Large rates early for fast progress
- Smaller rates later for fine-tuning
Types of Schedules
Step Decay: Reduce learning rate by a factor every N epochs
def step_decay_schedule(initial_lr, drop_rate=0.5, epochs_drop=10):
def schedule(epoch):
return initial_lr * (drop_rate ** (epoch // epochs_drop))
return schedule
Exponential Decay: Continuously reduce learning rate exponentially
def exponential_decay(initial_lr, decay_rate=0.95):
def schedule(epoch):
return initial_lr * (decay_rate ** epoch)
return schedule
Cosine Annealing: Smooth reduction following cosine curve
def cosine_annealing(initial_lr, T_max, eta_min=0):
def schedule(epoch):
return eta_min + (initial_lr - eta_min) * \
(1 + np.cos(np.pi * epoch / T_max)) / 2
return schedule
Warm-up: Start with small learning rate, increase, then decay
def warmup_schedule(initial_lr, warmup_epochs, target_lr):
def schedule(epoch):
if epoch < warmup_epochs:
return initial_lr + (target_lr - initial_lr) * epoch / warmup_epochs
return target_lr
return schedule
Second-Order Methods
Newton’s Method
Newton’s method uses second-order information:
θ = θ - H⁻¹ × ∇J(θ)
where H is the Hessian matrix of second derivatives. Faster convergence but expensive to compute.
L-BFGS
Limited-memory Broyden-Fletcher-Goldfarb-Shanno approximates the Hessian using gradient history. Effective for small-to-medium problems.
Choosing an Optimizer
Practical Guidelines
For most deep learning: Adam or AdamW
- Works well out of the box
- Handles varying gradient scales
- Good default hyperparameters
For large models, limited compute: SGD with momentum
- More computationally efficient than Adam
- Often achieves better generalization
- Requires careful learning rate tuning
For RNNs/LSTMs: Adam or RMSprop
- Handles gradient scale variations
- Adapts to different parameter magnitudes
For sparse features: Adagrad or Adam
- Adapts learning rate per parameter
- Handles sparse data well
Hyperparameter Defaults
| Optimizer | Learning Rate | Momentum | Beta1 | Beta2 |
|---|---|---|---|---|
| SGD | 0.01 | 0.9 | - | - |
| Adam | 0.001 | - | 0.9 | 0.999 |
| RMSprop | 0.001 | - | - | 0.99 |
| Adagrad | 0.01 | - | - | - |
Implementation in PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 2)
# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Adam (most commonly used)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
# AdamW (with weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# Learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# Or cosine annealing
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# Training loop
for epoch in range(100):
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()
scheduler.step()
Common Pitfalls
Gradient Exploding
Symptoms: Loss becomes NaN, weights become very large
Solutions:
- Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
- Reduce learning rate
- Check for numerical issues in data
Gradient Vanishing
Symptoms: Loss doesn’t decrease, gradients are near zero
Solutions:
- Use ReLU activation
- Proper weight initialization
- Use batch normalization
- Consider LSTM/GRU for RNNs
Learning Rate Too Large
Symptoms: Loss oscillates or diverges
Solutions:
- Reduce learning rate
- Use learning rate finder
- Enable gradient clipping
Advanced Techniques
Gradient Clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
Gradient Accumulation
For large models that can’t fit in memory:
accumulation_steps = 4
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
loss = model(batch)
(loss / accumulation_steps).backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
output = model(batch)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Conclusion
Understanding optimization algorithms is fundamental to machine learning success. While Adam has become the default choice for most deep learning applications, understanding the alternatives—and when they excel—enables better model development.
The field continues to evolve. New algorithms like Sharpness-Aware Minimization (SAM) improve generalization, while research into understanding why Adam sometimes underperforms SGD continues. The key is experimentation: no single optimizer works best for all problems.
Start with Adam, tune the learning rate, and consider alternatives when results are unsatisfactory. Understanding the fundamentals enables informed decisions about optimization strategy.
Resources
- An Overview of Gradient Descent Optimization Algorithms - Sebastian Ruder
- Adam: A Method for Stochastic Optimization
- PyTorch Optimizers Documentation
- Loss Landscape Visualization
Comments