Introduction
In the landscape of machine learning, probabilistic models have long dominated—from Gaussian distributions to Bayesian networks to modern variational autoencoders. However, there’s an elegant alternative approach that frames learning as energy minimization rather than probability maximization: Energy-Based Models (EBMs).
Proposed by Yann LeCun and colleagues in the 2000s, energy-based models offer a unified framework that treats learning as finding low-energy configurations in a system. Rather than explicitly modeling probability distributions, EBMs assign scalar energy values to configurations of variables. Lower energy means higher compatibility—exactly what we want for good solutions.
In 2026, EBMs have experienced a renaissance, particularly in generative modeling, out-of-distribution detection, and building AI systems that understand constraints. This comprehensive guide explores the theory, implementation, and practical applications of energy-based models.
Foundations of Energy-Based Models
The Energy Metaphor
The core idea is beautifully simple: instead of modeling P(x) directly, we learn an energy function E(x) that maps each configuration x to a scalar. The energy can be thought of as a “cost” or “incompatibility measure”—configurations we want are assigned low energy, while undesirable configurations get high energy.
import torch
import torch.nn as nn
class SimpleEBM(nn.Module):
"""
Basic Energy-Based Model that assigns energy to inputs.
Lower energy = more likely configuration.
"""
def __init__(self, input_dim, hidden_dim=256):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1) # Single energy value
)
def forward(self, x):
"""Compute energy for input configuration."""
return self.network(x)
def energy(self, x):
"""Alias for forward pass."""
return self.forward(x)
From Energy to Probability
While EBMs don’t directly output probabilities, we can convert energies to probabilities using the Gibbs distribution (also called Boltzmann distribution):
$$P(x) = \frac{1}{Z} e^{-E(x)}$$Where Z is the partition function:
$$Z = \sum_{x'} e^{-E(x'))}$$The partition function normalizes the energies into a proper probability distribution by summing over all possible configurations—a computationally intractable problem for high-dimensional spaces.
def gibbs_distribution(energy, temperature=1.0):
"""
Convert energy to probability using Gibbs distribution.
Args:
energy: Energy values (lower = more probable)
temperature: Temperature parameter (higher = more uniform)
"""
# P(x) ∝ exp(-E(x)/T)
return torch.exp(-energy / temperature)
def partition_function_estimate(energies, num_samples=1000):
"""
Estimate partition function using Monte Carlo sampling.
Z = Σ exp(-E(x))
"""
# Use negative energies for numerical stability
log_weights = -energies
max_log = log_weights.max()
# Log-sum-exp trick
log_z = max_log + torch.log(
torch.exp(log_weights - max_log).sum() + 1e-10
)
return torch.exp(log_z)
Why Choose Energy-Based Models?
EBMs offer several advantages over explicit probabilistic models:
-
No Normalization Required During Training: Unlike VAEs or normalized flow models, we don’t need to compute Z during training.
-
Flexible Architecture: Any differentiable network can define an energy function.
-
Natural Handling of Constraints: We can incorporate hard constraints by assigning infinite energy to invalid configurations.
-
Interpretable Energy Values: Energy scores directly indicate how “good” or “bad” a configuration is.
-
Compositionality: Energy functions can be combined through addition, enabling complex systems from simple components.
Architecture Types
1. Boltzmann Machines
The classic energy-based model—the Boltzmann Machine—defines energy over binary random variables:
$$E(v, h) = -b^\top v - c^\top h - h^\top W v$$Where:
- v = visible units
- h = hidden units
- b, c = biases
- W = weight matrix
class RestrictedBoltzmannMachine(nn.Module):
"""
Restricted Boltzmann Machine with visible and hidden layers.
RBMs have no intra-layer connections, making inference tractable.
"""
def __init__(self, n_visible, n_hidden, k=1):
super().__init__()
self.n_visible = n_visible
self.n_hidden = n_hidden
self.k = k # Number of Gibbs sampling steps
# Weight matrix
self.W = nn.Parameter(
torch.randn(n_visible, n_hidden) * 0.01
)
# Biases
self.v_bias = nn.Parameter(torch.zeros(n_visible))
self.h_bias = nn.Parameter(torch.zeros(n_hidden))
def energy(self, v, h):
"""Compute energy of (v, h) configuration."""
return -(
torch.sum(v @ self.W @ h, dim=1) +
torch.sum(self.v_bias * v, dim=1) +
torch.sum(self.h_bias * h, dim=1)
)
def sample_h_given_v(self, v):
"""Sample hidden units given visible units."""
# P(h=1|v) = sigmoid(v @ W + c)
h_prob = torch.sigmoid(v @ self.W + self.h_bias)
h_sample = torch.bernoulli(h_prob)
return h_prob, h_sample
def sample_v_given_h(self, h):
"""Sample visible units given hidden units."""
# P(v=1|h) = sigmoid(h @ W^T + b)
v_prob = torch.sigmoid(h @ self.W.t() + self.v_bias)
v_sample = torch.bernoulli(v_prob)
return v_prob, v_sample
def gibbs_step(self, v):
"""Perform one Gibbs sampling step."""
h_prob, h_sample = self.sample_h_given_v(v)
v_prob, v_sample = self.sample_v_given_h(h_sample)
return v_prob, v_sample
def contrastive_divergence(self, v0, lr=0.01):
"""
Train RBM using Contrastive Divergence (CD-k).
CD approximates the gradient of the log-likelihood
using k steps of Gibbs sampling.
"""
# Positive phase: sample from data
h0_prob, h0_sample = self.sample_h_given_v(v0)
# Negative phase: k Gibbs steps
vk = v0
for _ in range(self.k):
vk_prob, vk = self.sample_v_given_h(
torch.bernoulli(h0_prob)
)
hk_prob, hk_sample = self.sample_h_given_v(vk)
# Compute gradients
positive_grad = v0.t() @ h0_prob
negative_grad = vk.t() @ hk_prob
# Update weights
self.W.data += lr * (positive_grad - negative_grad) / v0.size(0)
self.v_bias.data += lr * (v0 - vk).mean(dim=0)
self.h_bias.data += lr * (h0_prob - hk_prob).mean(dim=0)
2. Neural Energy Models
Modern EBMs use deep networks as energy functions:
class ConvEBM(nn.Module):
"""
Convolutional Energy-Based Model for images.
"""
def __init__(self, channels=3, height=32, width=32):
super().__init__()
# Convolutional feature extractor
self.features = nn.Sequential(
nn.Conv2d(channels, 64, 4, stride=2, padding=1),
nn.LeakyReLU(0.2),
nn.Conv2d(64, 128, 4, stride=2, padding=1),
nn.LeakyReLU(0.2),
nn.Conv2d(128, 256, 4, stride=2, padding=1),
nn.LeakyReLU(0.2),
)
# Energy computation
self.energy_head = nn.Sequential(
nn.Flatten(),
nn.Linear(256 * 4 * 4, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 1)
)
def forward(self, x):
features = self.features(x)
energy = self.energy_head(features)
return energy
3. Joint Energy Models
JEM (Joint Energy Models) combines classification and generative modeling:
class JointEnergyModel(nn.Module):
"""
JEM: Combines classifier and EBM in a single framework.
Can both classify and detect out-of-distribution samples.
"""
def __init__(self, num_classes, backbone=None):
super().__init__()
# Use provided backbone or create default
if backbone is None:
self.backbone = nn.Sequential(
nn.Conv2d(3, 64, 4, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(64, 128, 4, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(128, 256, 4, stride=2, padding=1),
nn.ReLU(),
nn.Flatten(),
)
else:
self.backbone = backbone
# Classification head
self.classifier = nn.Linear(256, num_classes)
# Energy head (for all classes jointly)
self.energy_head = nn.Sequential(
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
def forward(self, x, return_energy=False):
"""
Forward pass returns both classification logits
and joint energy.
"""
features = self.backbone(x)
logits = self.classifier(features)
if return_energy:
# Joint energy across all class configurations
energy = self.joint_energy(features, logits)
return logits, energy
return logits
def joint_energy(self, features, logits):
"""
Compute E(x, y) = -log p(y|x) + E(x)
Energy of (input, label) pair.
"""
# Unnormalized log likelihood term
log_prob = logits.logsumexp(dim=1)
# Joint energy: lower for correct classifications
energy = -log_prob
return energy
def classify(self, x):
"""Standard classification."""
return self.forward(x).argmax(dim=1)
def detect_ood(self, x, threshold=None):
"""
Detect out-of-distribution samples.
OOD samples have higher energy.
"""
_, energy = self.forward(x, return_energy=True)
if threshold is None:
# Return energy scores
return energy
return energy > threshold
Training Methods
1. Contrastive Divergence
The classic training algorithm for RBMs:
def train_rbm(rbm, data_loader, epochs=10, lr=0.01, k=1):
"""Train RBM using Contrastive Divergence."""
for epoch in range(epochs):
total_error = 0
for batch in data_loader:
v0 = batch[0] # Visible units
# CD-k step
rbm.contrastive_divergence(v0, lr=lr)
# Compute reconstruction error
h0_prob, _ = rbm.sample_h_given_v(v0)
v1_prob, _ = rbm.sample_v_given_h(
torch.bernoulli(h0_prob)
)
error = ((v0 - v1_prob) ** 2).sum(dim=1).mean()
total_error += error.item()
print(f"Epoch {epoch}: Error = {total_error / len(data_loader):.4f}")
2. Maximum Likelihood Estimation
For modern EBMs, we maximize log-likelihood:
$$\nabla \log p(x) = -\nabla E(x) + \mathbb{E}_{x' \sim p(x')}[\nabla E(x')]$$The gradient has two terms:
- Positive phase: Lower energy for data samples
- Negative phase: Increase energy for model samples
class EnergyBasedModelTrainer:
"""
Trainer for modern EBMs using MLE or variants.
"""
def __init__(self, model, lr=0.0001):
self.model = model
self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
def train_step(self, real_data, num_negative_samples=100):
"""
Single training step using score matching or MLE.
"""
self.model.train()
self.optimizer.zero_grad()
# Positive phase: low energy for real data
real_energy = self.model(real_data)
# Negative phase: higher energy for generated samples
# Various methods available:
# 1. Langevin dynamics sampling
# 2. Markov Chain Monte Carlo
# 3. Auxiliary variable methods
negative_samples = self.generate_samples(
real_data.shape,
num_samples=num_negative_samples
)
fake_energy = self.model(negative_samples.detach())
# Energy difference loss
loss = (real_energy - fake_energy).mean()
loss.backward()
self.optimizer.step()
return loss.item()
def generate_samples(self, shape, num_samples=100,
steps=10, step_size=0.01):
"""
Generate samples using Langevin dynamics.
"""
samples = torch.randn(num_samples, *shape[1:]).to(next(
self.model.parameters()).device)
samples.requires_grad = True
for _ in range(steps):
# Compute energy gradient
energy = self.model(samples)
grad = torch.autograd.grad(
energy.sum(), samples, retain_graph=True
)[0]
# Langevin update
noise = torch.randn_like(samples) * np.sqrt(2 * step_size)
samples = samples - step_size * grad + noise
samples = samples.detach()
samples.requires_grad = True
return samples
3. Score Matching
An alternative that avoids partition function entirely:
def score_matching_loss(model, x, sigma=1.0):
"""
Score matching loss for EBMs.
Minimizes: 0.5 * ||∇_x E(x) - ∇_x log p_data(x)||²
With Gaussian noise perturbation:
Loss ≈ E[||∇_x E(x) + (x - y)/σ²||²]
Where y = x + σ * ε, ε ~ N(0, I)
"""
# Add Gaussian noise
noise = torch.randn_like(x) * sigma
y = x + noise
# Enable gradients
y.requires_grad = True
# Compute energy and its gradient
energy = model(y)
grad = torch.autograd.grad(
energy.sum(), y, create_graph=True, retain_graph=True
)[0]
# Score matching objective
target = -noise / (sigma ** 2)
loss = ((grad - target) ** 2).sum(dim=1).mean()
return loss
Sampling Methods
Langevin Dynamics
The foundation of modern EBM sampling:
def langevin_dynamics(energy_fn, initial_samples,
num_steps=100, step_size=0.01,
noise_scale=None):
"""
Sample from EBM using Langevin dynamics.
x_{t+1} = x_t - step_size * ∇_x E(x_t) + noise
The noise term (proportional to sqrt(2*step_size))
ensures proper exploration.
"""
samples = initial_samples.clone()
samples.requires_grad = True
if noise_scale is None:
noise_scale = np.sqrt(2 * step_size)
for _ in range(num_steps):
# Compute energy gradient
energy = energy_fn(samples)
grad = torch.autograd.grad(
energy.sum(), samples, retain_graph=True
)[0]
# Add noise and apply update
noise = torch.randn_like(samples) * noise_scale
samples = samples - step_size * grad + noise
# Detach for next iteration (Markov chain)
samples = samples.detach()
samples.requires_grad = True
return samples
class ModernEBMGenerator:
"""
Complete sampling pipeline for modern EBMs.
"""
def __init__(self, model, num_steps=60,
step_size=0.01, noise_scale=0.007):
self.model = model
self.num_steps = num_steps
self.step_size = step_size
self.noise_scale = noise_scale
@torch.no_grad()
def generate(self, batch_size, shape, device='cuda'):
"""Generate samples from EBM."""
# Initialize from noise
samples = torch.randn(
batch_size, *shape, device=device
)
# Langevin dynamics
for _ in range(self.num_steps):
samples.requires_grad = True
# Forward pass
energy = self.model(samples)
# Compute gradient
grad = torch.autograd.grad(
energy.sum(), samples
)[0]
# Update with noise
noise = torch.randn_like(samples) * self.noise_scale
samples = samples - self.step_size * grad + noise
samples = samples.detach()
return samples.clamp(-1, 1)
Hamiltonian Monte Carlo
More efficient for high-dimensional spaces:
def hmc_sampler(energy_fn, initial_state, num_steps=10,
step_size=0.01, num_leapfrog=10):
"""
Hamiltonian Monte Carlo sampling.
Uses auxiliary momentum variables to propose samples.
"""
# Initialize
q = initial_state.clone()
# Sample momentum
p = torch.randn_like(q) * 1.0 # momentum
# Current energy
current_energy = energy_fn(q) + 0.5 * (p ** 2).sum()
# Leapfrog integration
p = p - 0.5 * step_size * torch.autograd.grad(
energy_fn(q).sum(), q, create_graph=True
)[0]
for _ in range(num_leapfrog):
q = q + step_size * p
if _ < num_leapfrog - 1:
p = p - step_size * torch.autograd.grad(
energy_fn(q).sum(), q, create_graph=True
)[0]
p = p - 0.5 * step_size * torch.autograd.grad(
energy_fn(q).sum(), q
)[0]
# Negate momentum
p = -p
# Compute new energy
new_energy = energy_fn(q) + 0.5 * (p ** 2).sum()
# Metropolis acceptance
if torch.rand(1) < torch.exp(current_energy - new_energy):
return q
return initial_state
Practical Applications
1. Out-of-Distribution Detection
EBMs naturally excel at detecting OOD samples:
class OODDetector:
"""
Out-of-distribution detection using EBM energy.
"""
def __init__(self, model):
self.model = model
self.model.eval()
def compute_energy(self, x):
"""Compute energy scores for inputs."""
with torch.no_grad():
energy = self.model(x)
return energy
def detect_ood(self, test_loader, ood_loader,
threshold_percentile=95):
"""
Detect OOD samples based on energy threshold.
"""
# Get in-distribution energies
id_energies = []
for batch in test_loader:
energies = self.compute_energy(batch[0])
id_energies.append(energies)
id_energies = torch.cat(id_energies)
# Get OOD energies
ood_energies = []
for batch in ood_loader:
energies = self.compute_energy(batch[0])
ood_energies.append(energies)
ood_energies = torch.cat(ood_energies)
# Set threshold
threshold = torch.quantile(id_energies,
threshold_percentile / 100)
# Compute detection metrics
id_below = (id_energies < threshold).float().mean()
ood_below = (ood_energies < threshold).float().mean()
fpr95 = ood_below.item()
return {
'threshold': threshold.item(),
'id_detection_rate': id_below.item(),
'fpr95': fpr95,
'ood_detection_rate': 1 - ood_below.item()
}
2. Constrained Generation
EBMs naturally incorporate constraints:
class ConstrainedEBM(nn.Module):
"""
EBM with hard constraints encoded in energy function.
"""
def __init__(self, base_model, constraints):
super().__init__()
self.base_model = base_model
self.constraints = constraints
def forward(self, x):
base_energy = self.base_model(x)
# Add constraint energy (infinite for violations)
constraint_energy = 0
for constraint_fn, weight in self.constraints:
violation = constraint_fn(x)
# Exponential penalty for constraint violations
constraint_energy += weight * torch.clamp(violation, min=0)
return base_energy + constraint_energy
def create_constraints():
"""Example constraint functions."""
constraints = [
# Constraint 1: Output sum to 1 (probability)
(lambda x: (x.sum(dim=1) - 1).abs(), 10.0),
# Constraint 2: Non-negative values
(lambda x: torch.clamp(-x, min=0).sum(), 10.0),
# Constraint 3: L2 norm <= 1
(lambda x: torch.clamp(x.norm(dim=1) - 1, min=0).sum(), 5.0),
]
return constraints
3. Image Generation
Modern EBMs compete with GANs and diffusion models:
class ImageEBM(nn.Module):
"""
EBM for high-resolution image generation.
"""
def __init__(self, latent_dim=256):
super().__init__()
# Residual network for energy computation
self.network = nn.Sequential(
nn.Linear(latent_dim, 512),
nn.ResidualBlock(512),
nn.ResidualBlock(512),
nn.ResidualBlock(512),
nn.Linear(512, 1)
)
def forward(self, z):
return self.network(z)
def train_image_ebm():
"""Complete training pipeline for image EBM."""
# Architecture
ebm = ImageEBM().cuda()
gen = nn.Sequential(
nn.Linear(256, 512),
nn.ResBlock(512),
nn.Linear(512, 32*32*3),
nn.Unflatten(1, (3, 32, 32))
).cuda()
optimizer = torch.optim.Adam(
list(ebm.parameters()) + list(gen.parameters()),
lr=1e-4
)
for epoch in range(100):
for batch in dataloader:
real_images = batch.cuda()
batch_size = real_images.size(0)
# Generate fake images
z = torch.randn(batch_size, 256).cuda()
fake_images = gen(z).tanh()
# Compute energies
real_energy = ebm(real_images)
fake_energy = ebm(fake_images)
# MCMC-based loss
loss = (real_energy - fake_energy).mean()
# Alternative: hinge loss
# loss = (real_energy - fake_energy.relu()).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
return ebm, gen
Advanced Topics
1. Equivariant EBMs
Incorporating symmetries into energy functions:
class EquivariantEBM(nn.Module):
"""
EBM with equivariance built in.
"""
def __init__(self, group):
super().__init__()
self.group = group # e.g., rotation group
def apply_group_action(self, x, g):
"""Apply group transformation g to input x."""
# For rotation group SO(2):
angle = g
cos_a, sin_a = torch.cos(angle), torch.sin(angle)
# Rotate 2D coordinates
x_rot = x[..., 0] * cos_a + x[..., 1] * sin_a
y_rot = -x[..., 0] * sin_a + x[..., 1] * cos_a
return torch.stack([x_rot, y_rot], dim=-1)
def invariant_energy(self, x):
"""Compute invariant energy."""
# Average energy over group
energies = []
for g in self.group.elements():
x_g = self.apply_group_action(x, g)
energies.append(self.base_energy(x_g))
return torch.stack(energies).mean(dim=0)
2. Latent Variable EBMs
Combining VAE-like latents with EBM framework:
class LatentEBM(nn.Module):
"""
EBM with learned latent variables.
"""
def __init__(self, data_dim, latent_dim, hidden_dim=256):
super().__init__()
# Encoder: x -> z
self.encoder = nn.Sequential(
nn.Linear(data_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, latent_dim * 2) # mean, logvar
)
# Energy function: E(x, z)
self.energy = nn.Sequential(
nn.Linear(data_dim + latent_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def sample_latent(self, x):
"""Sample latent from encoder."""
params = self.encoder(x)
mean, logvar = params.chunk(2, dim=-1)
std = logvar.exp().sqrt()
z = mean + std * torch.randn_like(std)
return z, mean, logvar
def forward(self, x, z):
"""Compute joint energy E(x, z)."""
return self.energy(torch.cat([x, z], dim=-1))
def energy_likelihood(self, x, z):
"""Compute E(x) = -log ∫ exp(-E(x,z)) dz."""
# Need to integrate over latent space
energy = self.forward(x, z)
# This is approximate, need proper inference
return energy
Performance Considerations
Memory-Efficient Training
class MemoryEfficientEBM(nn.Module):
"""
EBM with gradient checkpointing for memory efficiency.
"""
def __init__(self, model):
super().__init__()
self.model = model
def energy_with_checkpoint(self, x):
"""Compute energy with gradient checkpointing."""
return torch.utils.checkpoint.checkpoint(
self.model, x, use_reentrant=False
)
def langevin_gradients(self, x, num_steps=10):
"""Langevin dynamics with gradient reuse."""
trajectory = [x.clone()]
for step in range(num_steps):
x.requires_grad = True
if step == 0:
# First step: compute gradient normally
energy = self.model(x)
grad = torch.autograd.grad(
energy.sum(), x, retain_graph=True
)[0]
else:
# Reuse gradient computation
energy = self.model(x)
grad = torch.autograd.grad(
energy.sum(), x
)[0]
x = x - 0.01 * grad + 0.007 * torch.randn_like(x)
trajectory.append(x.clone())
return trajectory
Best Practices
-
Start with Simple Architectures: Begin with small models and increase complexity gradually.
-
Careful MCMC Tuning: Langevin dynamics parameters (step size, noise, number of steps) critically affect performance.
-
Monitor Partition Function: Even though we don’t compute it directly, track energy statistics during training.
-
Use Appropriate Loss: Hinge losses often train more stably than direct energy differences.
-
Combine with Other Models: EBMs work well as components in larger hybrid systems.
Future Directions in 2026
Emerging Applications
Foundation Models as EBMs: Treating large pre-trained models as energy functions for constrained generation.
Neuro-Symbolic AI: Using EBMs to encode logical constraints in neural systems.
Scientific Simulation: Energy-based models for molecular dynamics and physical systems.
Bayesian Deep Learning: EBMs as implicit priors in Bayesian neural networks.
Robotics: Energy functions for planning and control in robotics.
Resources
- Energy-Based Models (LeCun et al., 2006)
- JEM: Joint Energy Models
- Generative Models as Energy-Based Models (Du & Mordatch, 2019)
- EBM GitHub Resources
Conclusion
Energy-based models provide a powerful, flexible framework that complements probabilistic approaches. By focusing on energy landscapes rather than explicit probabilities, EBMs offer unique advantages: natural constraint handling, compositionality, and avoidance of partition function complications during training.
The renaissance of EBMs in recent years—driven by improved sampling techniques and their natural fit for modern deep learning architectures—positions them as a crucial tool in the AI practitioner’s toolkit. Whether for out-of-distribution detection, constrained generation, or building systems that understand physical constraints, energy-based models offer an elegant solution framework.
As research continues, expect to see EBMs increasingly integrated into production AI systems, particularly as we demand more from our models than simple prediction—requiring instead a deep understanding of what makes configurations valid, safe, and useful.
Comments