Gradient Descent and Optimization Algorithms

Introduction

Optimization algorithms form the backbone of machine learning. Whether training a simple linear regression model or a deep neural network with millions of parameters, the choice of optimization algorithm significantly impacts training speed, convergence quality, and final model performance. Among these algorithms, gradient descent and its variants dominate practical applications.

In 2026, understanding optimization algorithms remains essential for machine learning practitioners. This article provides a comprehensive exploration of gradient descent variants, from basic batch gradient descent to sophisticated adaptive methods like Adam. We examine the mathematical foundations, implementation details, and practical guidance for selecting and tuning optimizers.

Fundamentals of Gradient Descent

The Core Idea

Gradient descent optimizes a differentiable objective function by iteratively moving in the direction of the negative gradient. Given an objective function J(θ) with parameters θ, the update rule is:

θ = θ - η × ∇J(θ)

where η is the learning rate (step size) and ∇J(θ) is the gradient of the objective function with respect to parameters.

The gradient points in the direction of steepest ascent, so moving in the opposite direction decreases the function value. This simple principle underlies virtually all neural network training.

Learning Rate

The learning rate η is perhaps the most critical hyperparameter:

Too large: Updates overshoot minima, causing divergence or oscillation
Too small: Convergence is painfully slow, potentially getting stuck in local minima
Just right: Efficient convergence to a good solution

Variants of Gradient Descent

Three primary variants differ in how much data is used per gradient computation:

Batch Gradient Descent (BGD) computes the gradient using the entire dataset:

θ = θ - η × ∇J(θ; all training samples)

Pros: Stable convergence, smooth gradient estimates Cons: Slow for large datasets, requires entire dataset in memory

Stochastic Gradient Descent (SGD) computes the gradient using a single sample:

θ = θ - η × ∇J(θ; x_i, y_i)

Pros: Fast iterations, can escape local minima, online learning Cons: Noisy gradients, requires many iterations

Mini-batch Gradient Descent balances both approaches:

θ = θ - η × ∇J(θ; batch of b samples)

This is the standard approach in deep learning, typically using batch sizes of 32, 64, 128, or 256.

import numpy as np

def batch_gradient_descent(X, y, lr=0.01, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    
    for _ in range(epochs):
        gradient = (1/m) * X.T @ (X @ theta - y)
        theta -= lr * gradient
    
    return theta

def stochastic_gradient_descent(X, y, lr=0.01, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    
    for _ in range(epochs):
        for i in range(m):
            xi = X[i:i+1]
            yi = y[i:i+1]
            gradient = xi.T @ (xi @ theta - yi)
            theta -= lr * gradient
    
    return theta

def mini_batch_gradient_descent(X, y, lr=0.01, epochs=100, batch_size=32):
    m, n = X.shape
    theta = np.zeros(n)
    
    for _ in range(epochs):
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        for i in range(0, m, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]
            
            gradient = (1/len(X_batch)) * X_batch.T @ (X_batch @ theta - y_batch)
            theta -= lr * gradient
    
    return theta

Momentum-Based Methods

The Problem with Vanilla Gradient Descent

In directions with small gradients, vanilla gradient descent makes slow progress. Additionally, oscillations occur when gradients change direction rapidly, as in ravines—elongated curved structures common in optimization landscapes.

Momentum

Momentum simulates a ball rolling down a hill, accumulating velocity in consistent directions:

v = γ × v + η × ∇J(θ) θ = θ - v

where γ (typically 0.9) is the momentum coefficient. Momentum accelerates in consistent directions and dampens oscillations.

def gradient_descent_with_momentum(X, y, lr=0.01, gamma=0.9, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    v = np.zeros(n)
    
    for _ in range(epochs):
        gradient = (1/m) * X.T @ (X @ theta - y)
        v = gamma * v + lr * gradient
        theta -= v
    
    return theta

Nesterov Accelerated Gradient (NAG)

NAG looks ahead before computing the gradient:

v = γ × v + η × ∇J(θ - γ × v) θ = θ - v

This “look-ahead” mechanism provides better convergence than standard momentum.

def nesterov_accelerated_gradient(X, y, lr=0.01, gamma=0.9, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    v = np.zeros(n)
    
    for _ in range(epochs):
        theta_ahead = theta - gamma * v
        gradient = (1/m) * X.T @ (X @ theta_ahead - y)
        v = gamma * v + lr * gradient
        theta -= v
    
    return theta

Adaptive Learning Rate Methods

Adaptive methods adjust learning rates per parameter based on gradient history, handling sparse features and varying gradient scales.

Adagrad

Adagrad scales learning rates inversely to the square root of accumulated gradients:

G = G + g² θ = θ - η/√(G + ε) × g

where g is the gradient and ε prevents division by zero.

def adagrad(X, y, lr=0.1, eps=1e-8, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    G = np.zeros(n)
    
    for _ in range(epochs):
        gradient = (1/m) * X.T @ (X @ theta - y)
        G += gradient ** 2
        theta -= (lr / np.sqrt(G + eps)) * gradient
    
    return theta

Adagrad adapts well to sparse features but learning rates monotonically decrease.

RMSprop

RMSprop divides by exponential moving average of squared gradients:

E[g²] = γ × E[g²] + (1-γ) × g² θ = θ - η/√(E[g²] + ε) × g

This prevents learning rate decay while adapting to gradient magnitudes.

def rmsprop(X, y, lr=0.01, gamma=0.9, eps=1e-8, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    Eg2 = np.zeros(n)
    
    for _ in range(epochs):
        gradient = (1/m) * X.T @ (X @ theta - y)
        Eg2 = gamma * Eg2 + (1-gamma) * (gradient ** 2)
        theta -= (lr / np.sqrt(Eg2 + eps)) * gradient
    
    return theta

Adam

Adam (Adaptive Moment Estimation) combines momentum and RMSprop:

m = β₁ × m + (1-β₁) × g (first moment) v = β₂ × v + (1-β₂) × g² (second moment) m̂ = m/(1-β₁ᵗ) v̂ = v/(1-β₂ᵗ) θ = θ - η × m̂/√(v̂ + ε)

def adam(X, y, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    m = np.zeros(n)
    v = np.zeros(n)
    
    for t in range(1, epochs+1):
        gradient = (1/m) * X.T @ (X @ theta - y)
        
        m = beta1 * m + (1-beta1) * gradient
        v = beta2 * v + (1-beta2) * (gradient ** 2)
        
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        
        theta -= (lr * m_hat) / (np.sqrt(v_hat) + eps)
    
    return theta

AdamW (Adam with Weight Decay)

AdamW decouples weight decay from gradient-based updates:

θ = θ - η × (m̂/√(v̂ + ε) + λ × θ)

This provides better generalization than L2 regularization in Adam.

def adamw(X, y, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, 
           weight_decay=0.01, epochs=100):
    m, n = X.shape
    theta = np.zeros(n)
    m = np.zeros(n)
    v = np.zeros(n)
    
    for t in range(1, epochs+1):
        gradient = (1/m) * X.T @ (X @ theta - y)
        
        m = beta1 * m + (1-beta1) * gradient
        v = beta2 * v + (1-beta2) * (gradient ** 2)
        
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        
        theta -= lr * (m_hat / (np.sqrt(v_hat) + eps) + weight_decay * theta)
    
    return theta

Learning Rate Scheduling

Why Schedule Learning Rates?

Gradually reducing the learning rate improves convergence:

Large rates early for fast progress
Smaller rates later for fine-tuning

Types of Schedules

Step Decay: Reduce learning rate by a factor every N epochs

def step_decay_schedule(initial_lr, drop_rate=0.5, epochs_drop=10):
    def schedule(epoch):
        return initial_lr * (drop_rate ** (epoch // epochs_drop))
    return schedule

Exponential Decay: Continuously reduce learning rate exponentially

def exponential_decay(initial_lr, decay_rate=0.95):
    def schedule(epoch):
        return initial_lr * (decay_rate ** epoch)
    return schedule

Cosine Annealing: Smooth reduction following cosine curve

def cosine_annealing(initial_lr, T_max, eta_min=0):
    def schedule(epoch):
        return eta_min + (initial_lr - eta_min) * \
               (1 + np.cos(np.pi * epoch / T_max)) / 2
    return schedule

Warm-up: Start with small learning rate, increase, then decay

def warmup_schedule(initial_lr, warmup_epochs, target_lr):
    def schedule(epoch):
        if epoch < warmup_epochs:
            return initial_lr + (target_lr - initial_lr) * epoch / warmup_epochs
        return target_lr
    return schedule

Second-Order Methods

Newton’s Method

Newton’s method uses second-order information:

θ = θ - H⁻¹ × ∇J(θ)

where H is the Hessian matrix of second derivatives. Faster convergence but expensive to compute.

L-BFGS

Limited-memory Broyden-Fletcher-Goldfarb-Shanno approximates the Hessian using gradient history. Effective for small-to-medium problems.

Choosing an Optimizer

Practical Guidelines

For most deep learning: Adam or AdamW

Works well out of the box
Handles varying gradient scales
Good default hyperparameters

For large models, limited compute: SGD with momentum

More computationally efficient than Adam
Often achieves better generalization
Requires careful learning rate tuning

For RNNs/LSTMs: Adam or RMSprop

Handles gradient scale variations
Adapts to different parameter magnitudes

For sparse features: Adagrad or Adam

Adapts learning rate per parameter
Handles sparse data well

Hyperparameter Defaults

Optimizer	Learning Rate	Momentum	Beta1	Beta2
SGD	0.01	0.9	-	-
Adam	0.001	-	0.9	0.999
RMSprop	0.001	-	-	0.99
Adagrad	0.01	-	-	-

Implementation in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 2)

# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (most commonly used)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (with weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# Or cosine annealing
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Training loop
for epoch in range(100):
    for batch in dataloader:
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()
    scheduler.step()

Common Pitfalls

Gradient Exploding

Symptoms: Loss becomes NaN, weights become very large

Solutions:

Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Reduce learning rate
Check for numerical issues in data

Gradient Vanishing

Symptoms: Loss doesn’t decrease, gradients are near zero

Solutions:

Use ReLU activation
Proper weight initialization
Use batch normalization
Consider LSTM/GRU for RNNs

Learning Rate Too Large

Symptoms: Loss oscillates or diverges

Solutions:

Reduce learning rate
Use learning rate finder
Enable gradient clipping

Advanced Techniques

Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)

Gradient Accumulation

For large models that can’t fit in memory:

accumulation_steps = 4
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
    loss = model(batch)
    (loss / accumulation_steps).backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    with autocast():
        output = model(batch)
        loss = criterion(output, target)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Conclusion

Understanding optimization algorithms is fundamental to machine learning success. While Adam has become the default choice for most deep learning applications, understanding the alternatives—and when they excel—enables better model development.

The field continues to evolve. New algorithms like Sharpness-Aware Minimization (SAM) improve generalization, while research into understanding why Adam sometimes underperforms SGD continues. The key is experimentation: no single optimizer works best for all problems.

Start with Adam, tune the learning rate, and consider alternatives when results are unsatisfactory. Understanding the fundamentals enables informed decisions about optimization strategy.

Gradient Descent and Optimization Algorithms

Introduction

Fundamentals of Gradient Descent

The Core Idea

Learning Rate

Variants of Gradient Descent

Momentum-Based Methods

The Problem with Vanilla Gradient Descent

Momentum

Nesterov Accelerated Gradient (NAG)

Adaptive Learning Rate Methods

Adagrad

RMSprop

Adam

AdamW (Adam with Weight Decay)

Learning Rate Scheduling

Why Schedule Learning Rates?

Types of Schedules

Second-Order Methods

Newton’s Method

L-BFGS

Choosing an Optimizer

Practical Guidelines

Hyperparameter Defaults

Implementation in PyTorch

Common Pitfalls

Gradient Exploding

Gradient Vanishing

Learning Rate Too Large

Advanced Techniques

Gradient Clipping

Gradient Accumulation

Mixed Precision Training

Conclusion

Resources

Comments