Skip to main content

Diffusion Models: The Mathematics of Generative AI

Published: March 16, 2026 Updated: May 25, 2026 Larry Qu 11 min read

Introduction

Diffusion models have emerged as the dominant architecture for generative AI, powering systems like DALL-E, Stable Diffusion, and Midjourney. These models have fundamentally transformed AI-generated content, producing images, videos, and audio of unprecedented quality. In 2026, diffusion models continue to evolve, with advances in efficiency, control, and multi-modal generation.

The core intuition behind diffusion models is elegantly simple: start with pure noise and gradually denoise it to produce structured output. This process mirrors physical diffusion—hence the name—where particles spread from high to low concentration. By learning to reverse this diffusion process, the model learns to generate realistic data from randomness.

The Diffusion Process

Forward Diffusion: Adding Noise

The forward diffusion process progressively destroys structure in data by adding Gaussian noise over T timesteps. Starting from a clean data point x_0 (an image), we iteratively add noise:

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) x_{t-1}, β_t I)

Where β_t is a variance schedule that increases from small to large values. After enough steps, x_T approximates isotropic Gaussian noise—the data structure is completely destroyed.

The key mathematical convenience is that we can directly sample x_t at any timestep t without iterating through all previous steps:

q(x_t | x_0) = N(x_t; √(ᾱ_t) x_0, (1 - ᾱ_t) I)

Where ᾱ_t = ∏(1 - β_t). This allows us to create noisy versions of any training image.

Reverse Diffusion: Learning to Denoise

If we could reverse the forward process—going from noise back to clean data—we would have a generative model. The forward process is intractable to reverse exactly, but we can learn to approximate it.

The reverse process is parameterized as a neural network:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

The model learns to predict the mean μ_θ of the denoised distribution at each timestep. By iteratively applying this reverse process starting from random noise, we generate new samples.

Training Diffusion Models

The Objective Function

Training aims to maximize the likelihood of training data. The simplified loss function trains the model to predict the noise added at each step:

L_simple = E_{t,x_0,ε}[||ε - ε_θ(√(ᾱ_t) x_0 + √(1 - ᾱ_t)ε, t)||²]

This objective is surprisingly simple: given a noisy image and timestep, predict the noise that was added. The model learns to separate signal from noise.

Architecture: U-Net

The denoising network is typically a U-Net architecture. U-Nets consist of an encoder that progressively downsamples, a decoder that upsamples, and skip connections that preserve spatial information. This architecture is well-suited for image-to-image tasks where fine-grained details matter.

Modern diffusion models add several enhancements: attention layers for global context, timestep embeddings to inform the model of its current noise level, and residual connections for training stability.

Sampling and Generation

Iterative Refinement

Generating samples requires starting from random noise and iteratively applying the learned reverse process:

x_{t-1} = μ_θ(x_t, t) + σ_t · ε

where ε is sampled fresh at each step. This is repeated for t = T to t = 1, gradually transforming noise into coherent image.

More advanced samplers improve this basic approach. DDIM (Denoising Diffusion Implicit Models) enables deterministic generation with fewer steps. Other methods use guidance—classifiers or classifier-free guidance—to improve sample quality.

Classifier-Free Guidance

Classifier-free guidance dramatically improves sample quality by combining conditional and unconditional predictions:

ε_θ(x_t, t, c) = (1 + w) · ε_θ(x_t, t, c) - w · ε_θ(x_t, t)

The guidance scale w (typically 7-10) amplifies the conditioning signal, producing more detailed and faithful outputs at the cost of some diversity.

Stable Diffusion and Latent Diffusion

The Latent Space Innovation

Running diffusion in pixel space is computationally expensive. Stable Diffusion introduced latent diffusion: compress images into a smaller latent space using an encoder, diffuse in this compressed space, then decode back to pixels.

This approach reduces computation by ~8x while maintaining quality. The latent space captures semantic image features, making generation more controllable and efficient.

Text-to-Image Generation

Text conditioning is achieved through cross-attention layers that modulate the diffusion process. A text encoder (typically CLIP) converts prompts into embeddings that guide generation. The model learns to align its outputs with both the noisy image and text condition.

Applications Beyond Images

Video Generation

Diffusion models have expanded to video generation. Models like Sora, Runway, and others generate minutes of coherent video from text prompts. Video diffusion faces unique challenges: maintaining temporal consistency across frames while generating at high resolution.

Audio and Speech

Diffusion models generate high-quality speech and music. Companies use these models for text-to-speech that sounds nearly indistinguishable from human voices. Audio diffusion works directly in the waveform domain or uses spectral representations.

3D and Geometry

Recent work applies diffusion to 3D shape generation and point cloud synthesis. These methods generate meshes or point clouds that can be used in games, CAD, and virtual reality applications.

Advanced Techniques

ControlNet and Spatial Control

ControlNet adds additional conditioning signals beyond text. By copying the diffusion model’s weights and adding trainable control modules, ControlNet can condition on edge maps, depth maps, human poses, and other spatial inputs. This enables precise control over generated content’s structure.

DreamBooth and Personalization

DreamBooth fine-tunes diffusion models to learn specific concepts (like a particular pet or product) from just a few images. The model can then generate novel compositions featuring this concept in various contexts.

Inpainting and Editing

Diffusion models excel at image editing tasks. By masking parts of an image and re-diffusing, users can remove objects, replace backgrounds, or add new elements. Inpainting preserves consistency with the unmasked regions.

Implementing Diffusion Models

Basic DDPM Implementation

import torch
import torch.nn as nn
import numpy as np

class SimpleDiffusion:
    def __init__(self, timesteps=1000, beta_start=0.0001, beta_end=0.02):
        self.timesteps = timesteps
        self.beta = torch.linspace(beta_start, beta_end, timesteps)
        self.alpha = 1 - self.beta
        self.alpha_bar = torch.cumprod(self.alpha, dim=0)
    
    def forward_diffusion(self, x0, t, device):
        """Add noise to image at timestep t"""
        noise = torch.randn_like(x0)
        sqrt_alpha_bar = torch.sqrt(self.alpha_bar[t]).reshape(-1, 1, 1, 1)
        sqrt_one_minus_alpha_bar = torch.sqrt(1 - self.alpha_bar[t]).reshape(-1, 1, 1, 1)
        return sqrt_alpha_bar * x0 + sqrt_one_minus_alpha_bar * noise, noise

class UNet(nn.Module):
    def __init__(self, in_channels=3, out_channels=3):
        super().__init__()
        # Simplified U-Net architecture
        self.encoder = nn.Sequential(
            nn.Conv2d(in_channels, 64, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, stride=2, padding=1),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 3, stride=2, padding=1, output_padding=1),
            nn.ReLU(),
            nn.Conv2d(64, out_channels, 3, padding=1)
        )
    
    def forward(self, x, t):
        # In practice, timestep embedding would be added
        return self.decoder(self.encoder(x))
```python

### Using Hugging Face Diffusers

```python
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt, guidance_scale=7.5).images[0]

Challenges and Future Directions

Computational Cost

Diffusion models require significant compute for training and sampling. Research focuses on reducing steps needed (down from 1000 to under 10), model distillation, and efficient architectures.

Consistency and Coherence

Generating long videos and maintaining global consistency remains challenging. Newer architectures explore recurrence, hierarchical generation, and world models to improve coherence.

Understanding Model Behavior

As diffusion models become more capable, understanding their failures, biases, and limitations becomes crucial. Research on interpretability, safety, and alignment continues to be important.

Forward Diffusion Process in Detail

Variance Schedule Design

The variance schedule beta_t determines how quickly noise is added. Common schedules include linear, cosine, and learned:

import torch
import numpy as np

def cosine_beta_schedule(timesteps, s=0.008):
    """Cosine schedule from improved DDPM."""
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos((x / timesteps + s) / (1 + s) * np.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clamp(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps, beta_start=0.0001, beta_end=0.02):
    return torch.linspace(beta_start, beta_end, timesteps)

The cosine schedule adds noise more gradually, preserving information for longer and improving sample quality at fewer sampling steps compared to linear schedules.

Reparameterization Trick

The forward process admits a closed-form sample at any timestep:

def forward_diffusion_sample(x0, t, betas):
    """Sample from q(x_t | x_0) directly."""
    alphas = 1 - betas
    alpha_bars = torch.cumprod(alphas, dim=0)

    noise = torch.randn_like(x0)
    sqrt_alpha_bar = torch.sqrt(alpha_bars[t])
    sqrt_one_minus_alpha_bar = torch.sqrt(1 - alpha_bars[t])

    # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
    x_t = sqrt_alpha_bar * x0 + sqrt_one_minus_alpha_bar * noise
    return x_t, noise

This closed-form enables efficient training by sampling random timesteps and noise rather than simulating the full forward chain.

Reverse Diffusion Process in Detail

Parameterizing the Reverse Mean

The reverse process p_theta(x_{t-1} | x_t) is parameterized to predict the noise epsilon added during forward diffusion:

def p_losses(model, x0, t, betas):
    """Compute diffusion loss for noise prediction."""
    alpha_bars = torch.cumprod(1 - betas, dim=0)
    noise = torch.randn_like(x0)

    # Forward diffusion
    sqrt_alpha_bar = torch.sqrt(alpha_bars[t])
    sqrt_one_minus_alpha_bar = torch.sqrt(1 - alpha_bars[t])
    x_t = sqrt_alpha_bar * x0 + sqrt_one_minus_alpha_bar * noise

    # Predict noise
    predicted_noise = model(x_t, t)

    # Simple MSE loss
    loss = F.mse_loss(noise, predicted_noise)
    return loss

Sampling Loop

@torch.no_grad()
def sample(model, image_size, timesteps, betas):
    """Generate samples by iteratively denoising."""
    batch_size = 1
    x_t = torch.randn((batch_size, 3, image_size, image_size))
    alphas = 1 - betas
    alpha_bars = torch.cumprod(alphas, dim=0)

    for t in reversed(range(timesteps)):
        t_tensor = torch.full((batch_size,), t, dtype=torch.long)

        # Predict noise
        predicted_noise = model(x_t, t_tensor)

        # Compute x_{t-1}
        alpha_t = alphas[t]
        alpha_bar_t = alpha_bars[t]
        beta_t = betas[t]

        coeff1 = 1 / torch.sqrt(alpha_t)
        coeff2 = (1 - alpha_t) / torch.sqrt(1 - alpha_bar_t)
        x_t_minus_1 = coeff1 * (x_t - coeff2 * predicted_noise)

        # Add noise (except for final step)
        if t > 0:
            x_t_minus_1 += torch.sqrt(beta_t) * torch.randn_like(x_t)

        x_t = x_t_minus_1

    return x_t

DDPM vs DDIM

DDIM (Denoising Diffusion Implicit Models) modifies the sampling process to be deterministic:

DDPM: x_{t-1} = mu_theta(x_t, t) + sigma_t * epsilon
DDIM: x_{t-1} = mu_theta(x_t, t) + sigma_t * epsilon (epsilon derived from predicted x_0)
Aspect DDPM DDIM
Sampling Stochastic Deterministic
Steps needed 100-1000 10-100
Quality Excellent Slightly lower (few steps)
Interpolation Not smooth Smooth latent interpolation
Consistency Varies per run Identical for same latent

DDIM enables latent-space interpolation between images by smoothly varying the initial noise while keeping the sampling process deterministic. This is useful for creating smooth transitions in generated content.

Latent Diffusion Architecture

Stable Diffusion uses a Variational Autoencoder (VAE) to compress images into a latent space, reducing the diffusion computation by approximately 8x:

class StableDiffusion:
    """Conceptual latent diffusion pipeline."""

    def __init__(self):
        self.vae = AutoencoderKL()      # 8x compression
        self.unet = UNetConditional()    # Works in latent space
        self.text_encoder = CLIPTextModel()

    def generate(self, prompt, guidance_scale=7.5, steps=50):
        # Encode text
        text_embeddings = self.text_encoder(prompt)

        # Start from random latent
        latents = torch.randn(1, 4, 64, 64)  # For 512x512 images

        # Denoise in latent space
        for t in reversed(range(steps)):
            noise_pred = self.unet(latents, t, text_embeddings)
            latents = self.ddim_step(latents, noise_pred, t)

        # Decode to pixel space
        return self.vae.decode(latents)

The latent space captures semantic features while discarding imperceptible high-frequency details. This makes generation more efficient and often produces better results because the model focuses on meaningful image structure.

Classifier-Free Guidance Mathematical Derivation

Classifier-free guidance combines conditional and unconditional predictions to amplify conditioning:

def guided_prediction(unet, x_t, t, text_embeddings, guidance_scale=7.5):
    """Apply classifier-free guidance."""
    # Predict with conditioning
    cond_pred = unet(x_t, t, text_embeddings)

    # Predict without conditioning (empty text)
    null_embeddings = torch.zeros_like(text_embeddings)
    uncond_pred = unet(x_t, t, null_embeddings)

    # Guided prediction
    guided = uncond_pred + guidance_scale * (cond_pred - uncond_pred)
    return guided

The guidance scale w amplifies the conditioning signal. At w=0, the model ignores conditioning entirely. At w=7-10, conditioning strongly influences the output. Higher values produce more faithful but less diverse samples.

Sampling Speed Techniques

Latent Consistency Models (LCM)

LCM distills the diffusion process into fewer steps by directly predicting the solution of the PF-ODE:

from diffusers import LCMScheduler

pipe = StableDiffusionPipeline.from_pretrained("model")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

# Generate in 4 steps
image = pipe(prompt, num_inference_steps=4, guidance_scale=2.0).images[0]

Progressive Distillation

Progressive distillation trains student models to predict the output of a teacher model after multiple steps. Starting from a 1024-step teacher, each distillation halves the required steps: 1024 -> 512 -> 256 -> 128 -> 64 -> 32 -> 16 -> 8 -> 4.

Consistency Models

Consistency models directly learn a mapping from any noisy state to the clean data distribution, enabling single-step generation. This trades some quality for dramatically faster sampling, making diffusion viable for real-time applications.

Training Diffusion Models: Full Pipeline

import torch
import torch.nn as nn
import torch.nn.functional as F

def train_diffusion(model, dataloader, timesteps=1000, epochs=100):
    """Full training pipeline for DDPM."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    betas = linear_beta_schedule(timesteps)

    for epoch in range(epochs):
        for batch in dataloader:
            images = batch[0]
            batch_size = images.shape[0]

            # Sample random timesteps
            t = torch.randint(0, timesteps, (batch_size,))

            # Forward diffusion and loss
            loss = p_losses(model, images, t, betas)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

GANs vs VAEs vs Diffusion Models

Aspect GANs VAEs Diffusion Models
Training stability Unstable (minimax) Stable Very stable
Sample quality Excellent Blurry Excellent
Mode coverage Poor (mode collapse) Complete Complete
Sampling speed Fast (1 pass) Fast (1 pass) Slow (many steps)
Likelihood estimation No Yes Approx via ELBO
Latent space Interpretable Structured Complex
Diversity Limited High High

Diffusion models excel at quality and diversity but require many sampling steps. Recent advances in distillation and consistency models are narrowing the speed gap, making diffusion increasingly practical for real-time applications.

Resources

Conclusion

Diffusion models represent a breakthrough in generative AI, producing images, video, and audio of remarkable quality. The mathematical framework—learning to reverse a diffusion process—provides a stable training objective while enabling diverse conditioning signals. As efficiency improves and applications expand, diffusion models will continue to reshape creative industries and beyond.

Comments

👍 Was this article helpful?