Skip to main content

Reinforcement Learning Algorithms: From Q-Learning to Deep Q-Networks

Published: March 16, 2026 Updated: May 25, 2026 Larry Qu 14 min read

Introduction

Reinforcement Learning (RL) represents a fundamentally different paradigm from supervised and unsupervised learning. Instead of learning from labeled data or finding patterns in unlabeled data, RL agents learn through interaction with an environment, receiving rewards for their actions and learning to maximize cumulative reward over time. In 2026, RL has achieved remarkable successes, from beating world champions in complex games to robots that learn to walk and manipulate objects, to systems that optimize computer networks and financial portfolios.

The key insight of reinforcement learning is that intelligence emerges from the interaction between an agent and its environment. An agent takes actions, the environment responds with new states and rewards, and the agent learns from this feedback loop to improve its behavior. This learning paradigm mirrors how humans and animals learn through trial and error, making RL one of the most general and ambitious forms of machine learning.

The Reinforcement Learning Framework

Agent-Environment Interaction

The RL problem is formalized as a Markov Decision Process (MDP). At each time step, the agent observes the current state s_t from the environment, takes an action a_t, and receives a reward r_t and the next state s_{t+1}. This process continues until a terminal state is reached or for a fixed number of steps.

The agent’s behavior is defined by a policy π(a|s), which maps states to probability distributions over actions. The goal is to find a policy that maximizes the expected cumulative reward, often called the return. The return is typically discounted to prioritize immediate rewards over distant ones: G_t = r_t + γr_{t+1} + γ²r_{t+2} + …

Value Functions and Bellman Equations

Central to RL are value functions that estimate how good states or state-action pairs are. The state-value function V^π(s) represents the expected return when starting from state s and following policy π. The action-value function Q^π(s,a) represents the expected return when taking action a from state s and then following policy π.

The Bellman equations provide recursive relationships for value functions:

V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a)[r + γV^π(s')]
Q^π(s,a) = Σ_{s',r} p(s',r|s,a)[r + γ Σ_{a'} π(a'|s') Q^π(s',a')]

These equations are the foundation for many RL algorithms, expressing how values propagate backward from rewards through the state space.

Tabular Learning Methods

Q-Learning

Q-Learning is one of the most fundamental RL algorithms, learning optimal action values through temporal-difference updates. For each state-action pair, Q-learning maintains an estimate Q(s,a) and updates it based on the observed reward and the best possible future value:

Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') - Q(s,a)]

Where α is the learning rate and γ is the discount factor. This update gradually moves Q values toward the optimal action values, eventually learning the optimal policy regardless of the exploration strategy used.

Q-learning is guaranteed to converge to optimal values given sufficient exploration and appropriate learning rate decay. However, it struggles with large state spaces where representing Q for every state-action pair becomes impractical.

SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy algorithm that learns Q-values for the current policy rather than the optimal policy:

Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]

Where a’ is the action actually taken in state s’. This makes SARSA more conservative than Q-learning, as it accounts for the exploration being performed. SARSA is particularly useful when exploration costs are high or when learning policies that account for exploration noise.

Deep Q-Networks (DQN)

From Tables to Neural Networks

When state spaces are large or continuous, tabular Q-learning becomes impractical. Deep Q-Networks (DQN) address this by using deep neural networks to approximate Q-values. The network takes states as input and outputs Q-values for all actions.

The key innovation of DQN is experience replay and target networks. Experience replay stores transitions (s, a, r, s’) in a replay buffer and samples batches for training, breaking temporal correlation in the data. The target network provides stable targets for the Bellman equation, updated periodically rather than at every step.

The DQN loss is:

L = E[(r + γ max_{a'} Q_target(s',a') - Q_online(s,a))²]

Improvements to DQN

Several improvements have enhanced DQN’s performance and stability. Double DQN addresses overestimation by using two networks to select and evaluate actions. Dueling DQN decomposes Q-values into state-value and advantage functions, allowing better estimation of state values independently of actions. Prioritized Experience Replay samples important transitions more frequently based on TD error.

Rainbow DQN combines multiple improvements: double DQN, prioritized replay, dueling networks, distributional RL, and noisy networks. This combination achieved state-of-the-art results on the Atari benchmark.

Policy Gradient Methods

REINFORCE

Policy gradient methods directly optimize the policy without explicitly estimating value functions. REINFORCE updates policy parameters in the direction of gradient ascent on expected return:

∇J(θ) = E[∇_θ log π_θ(a|s) · G_t]
```sql

Where π_θ is the parameterized policy. This gradient tells us how to adjust parameters to increase the probability of good actions. The algorithm is simple but has high variance; returns can vary widely across episodes, making learning unstable.

### Actor-Critic Methods

Actor-critic methods combine policy gradients with value function approximation to reduce variance. The actor updates the policy parameters, while the critic estimates value functions to provide baselines. Common approaches include A2C/A3C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization).

A2C uses an advantage function A(s,a) = Q(s,a) - V(s) instead of raw returns, reducing variance while maintaining unbiased gradient estimates. The advantage measures how much better an action is compared to the average.

### Proximal Policy Optimization (PPO)

PPO has become one of the most popular RL algorithms due to its simplicity and effectiveness. It constrains policy updates to prevent destructive large changes:

```text
L^CLIP(θ) = E[min(r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t)]

Where r_t(θ) is the probability ratio between new and old policies, and ε is a small hyperparameter (typically 0.2). This clipped objective prevents the policy from changing too dramatically while still allowing learning.

PPO has succeeded in continuous control tasks, robotics, and has been foundational in large-scale RL applications like OpenAI’s ChatGPT training.

Advanced RL Topics

Multi-Agent Reinforcement Learning

When multiple agents interact, the environment becomes non-stationary from any single agent’s perspective. Multi-agent RL (MARL) introduces additional challenges: coordination, competition, and emergent behaviors. Topics include cooperative MARL, competitive MARL (like in game-playing), and mixed settings.

Hierarchical RL

Hierarchical RL decomposes complex tasks into subtasks, learning at multiple temporal scales. Options framework defines temporally extended actions (options) that can be composed. This approach enables faster learning and transfer across tasks.

Meta-Learning in RL

Meta-learning for RL focuses on learning to learn—acquiring knowledge that can quickly adapt to new tasks. Model-Agnostic Meta-Learning (MAML) learns initial parameters that can be quickly fine-tuned with few gradient steps on new tasks. This is crucial for sample-efficient RL in real-world applications.

Offline Reinforcement Learning

Offline (or batch) RL learns from fixed datasets without interaction. This is practical for applications where online exploration is dangerous or expensive. Conservative Q-Learning (CQL) and decision transformers have shown promise in offline settings, learning effective policies from static datasets.

Applications of Reinforcement Learning

Game Playing

RL has achieved superhuman performance in complex games. AlphaGo combined tree search with neural networks to beat world champions in Go. OpenAI Five learned to play Dota 2 through self-play. These systems demonstrate RL’s ability to handle enormous state and action spaces through clever combination of learning and search.

Robotics

RL enables robots to learn complex manipulation skills, locomotion, and navigation without explicit programming. Robots can learn to grasp novel objects, perform surgical maneuvers, and navigate complex environments. Sim-to-real transfer (training in simulation, deploying in reality) has made robotic RL practical.

Resource Management

RL optimizes computing resources, including job scheduling in data centers, network routing, and power grid management. These systems must make sequential decisions under uncertainty, a natural fit for RL.

Finance

In algorithmic trading and portfolio management, RL learns trading strategies that adapt to market conditions. While challenging due to non-stationarity and risk considerations, RL offers a data-driven approach to financial optimization.

Implementing RL Algorithms

Basic Q-Learning Implementation

A simple tabular Q-learning implementation:

import numpy as np

class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha  # learning rate
        self.gamma = gamma  # discount factor
        self.epsilon = epsilon  # exploration rate
    
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(len(self.q_table[state]))
        return np.argmax(self.q_table[state])
    
    def learn(self, state, action, reward, next_state):
        best_next = np.max(self.q_table[next_state])
        td_target = reward + self.gamma * best_next
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += self.alpha * td_error
```python

### DQN with PyTorch

A basic DQN implementation uses a neural network to approximate Q-values:

```python
import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    
    def forward(self, x):
        return self.network(x)

Markov Decision Processes Formalized

MDP Definition

A Markov Decision Process is defined by the tuple (S, A, P, R, gamma) where:

  • S is the set of states
  • A is the set of actions
  • P(s’ | s, a) is the transition probability to state s’ given state s and action a
  • R(s, a) is the immediate reward function
  • gamma in [0, 1] is the discount factor

The Markov property states that the future is independent of the past given the present: P(s_{t+1} | s_t, a_t) = P(s_{t+1} | s_1, a_1, …, s_t, a_t). This memoryless property makes MDPs tractable for planning and learning.

Returns and Episodes

The return G_t is the total discounted reward from time t:

G_t = R_{t+1} + gamma R_{t+2} + gamma^2 R_{t+3} + ... = Sum_{k=0}^{inf} gamma^k R_{t+k+1}

For episodic tasks (finite horizon), gamma = 1 works. For continuing tasks, gamma < 1 ensures finite returns and prioritizes near-term rewards.

The Bellman Equations

Bellman Expectation Equation

The state-value function V^pi(s) satisfies a recursive relationship:

V^pi(s) = E_pi[R_{t+1} + gamma V^pi(S_{t+1}) | S_t = s]
       = Sum_a pi(a|s) Sum_{s',r} p(s',r|s,a)[r + gamma V^pi(s')]

Similarly for the action-value function:

Q^pi(s,a) = E_pi[R_{t+1} + gamma Q^pi(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

Bellman Optimality Equation

The optimal value functions satisfy the Bellman optimality equations:

V*(s) = max_a E[R_{t+1} + gamma V*(S_{t+1}) | S_t = s, A_t = a]
Q*(s,a) = E[R_{t+1} + gamma max_a' Q*(S_{t+1}, a') | S_t = s, A_t = a]

These recursive equations form the foundation of value-based RL algorithms. Q-learning directly approximates the optimal action-value function using sampled transitions.

Q-learning vs SARSA

Algorithm Comparison

The key difference: Q-learning is off-policy (learns the optimal policy regardless of exploration) while SARSA is on-policy (learns the value of the current behavior policy).

# Q-learning update (off-policy)
def q_learning_update(q, state, action, reward, next_state, alpha, gamma):
    best_next = np.max(q[next_state])
    q[state][action] += alpha * (reward + gamma * best_next - q[state][action])

# SARSA update (on-policy)
def sarsa_update(q, state, action, reward, next_state, next_action, alpha, gamma):
    q[state][action] += alpha * (reward + gamma * q[next_state][next_action] - q[state][action])
Aspect Q-learning SARSA
Policy type Off-policy On-policy
Learned function Optimal Q* Current policy Q^pi
Exploration handling Ignores in update Accounts for exploration
Convergence To optimal (given conditions) To current policy value
Safety May learn risky paths More conservative
Cliff walking May fall off cliff (optimal) Takes safer path

When to Use Each

Q-learning is preferred when exploration can be controlled separately and optimality is the goal. SARSA is better when exploration noise matters, such as in robotics where exploration mistakes have real costs.

Policy Gradient Methods

REINFORCE Implementation

The REINFORCE algorithm updates policy parameters using the full return. While high-variance, it provides unbiased gradient estimates:

import torch
import torch.nn as nn
import torch.nn.functional as F

class PolicyNetwork(nn.Module):
    """Simple policy network for discrete action spaces."""

    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return F.softmax(self.fc3(x), dim=-1)

def reinforce(env, policy, optimizer, episodes=1000, gamma=0.99):
    """REINFORCE training loop."""
    for episode in range(episodes):
        states, actions, rewards = [], [], []
        state, done = env.reset(), False

        while not done:
            action_probs = policy(state)
            action = torch.multinomial(action_probs, 1).item()
            next_state, reward, done, _ = env.step(action)
            states.append(state); actions.append(action); rewards.append(reward)
            state = next_state

        # Compute discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)

        # Policy gradient update
        optimizer.zero_grad()
        loss = 0
        for t, (s, a, G_t) in enumerate(zip(states, actions, returns)):
            probs = policy(torch.FloatTensor(s))
            log_prob = torch.log(probs[a])
            loss += -log_prob * G_t
        loss.backward()
        optimizer.step()

Variance Reduction

REINFORCE suffers from high variance because returns vary widely. A baseline reduces variance while maintaining unbiasedness:

# With baseline: subtract state value from return
def reinforce_with_baseline(policy, value_net, optimizer_p, optimizer_v,
                            states, actions, rewards, gamma=0.99):
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)

    # Update value network (baseline)
    values = value_net(torch.stack(states)).squeeze()
    v_loss = F.mse_loss(values, returns)
    optimizer_v.zero_grad()
    v_loss.backward()
    optimizer_v.step()

    # Update policy with advantage
    advantages = returns - values.detach()
    policy_loss = 0
    for t, (s, a, adv) in enumerate(zip(states, actions, advantages)):
        probs = policy(s)
        log_prob = torch.log(probs[a])
        policy_loss += -log_prob * adv

    optimizer_p.zero_grad()
    policy_loss.backward()
    optimizer_p.step()

Actor-Critic Methods

A2C Implementation

Actor-Critic combines policy gradients (actor) with value function approximation (critic). The critic provides a low-variance baseline:

class ActorCritic(nn.Module):
    """Shared network for Actor-Critic with dual heads."""

    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.actor = nn.Linear(hidden_dim, action_dim)
        self.critic = nn.Linear(hidden_dim, 1)

    def forward(self, state):
        features = self.shared(state)
        action_probs = F.softmax(self.actor(features), dim=-1)
        value = self.critic(features)
        return action_probs, value

def a2c_train_step(model, optimizer, state, action, reward, next_state, done, gamma=0.99):
    """Single A2C training step."""
    action_probs, value = model(state)
    _, next_value = model(next_state)

    # TD target
    td_target = reward + gamma * next_value * (1 - done)
    advantage = td_target - value

    # Actor loss (policy gradient)
    log_prob = torch.log(action_probs.squeeze()[action])
    actor_loss = -log_prob * advantage.detach()

    # Critic loss (value function)
    critic_loss = F.mse_loss(value, td_target.detach())

    # Entropy bonus (encourages exploration)
    entropy = -(action_probs * torch.log(action_probs + 1e-8)).sum()

    total_loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

A3C (Asynchronous Advantage Actor-Critic) runs multiple workers in parallel, each interacting with its own copy of the environment. Workers asynchronously update a shared model, providing diverse experience and breaking temporal correlations. A2C is the synchronous version, often more sample-efficient due to synchronized updates.

Proximal Policy Optimization (PPO)

PPO constrains policy updates to prevent destructive large changes. The clipped surrogate objective ensures monotonic improvement:

class PPO:
    """Proximal Policy Optimization implementation."""

    def __init__(self, state_dim, action_dim, lr=3e-4, clip_epsilon=0.2):
        self.policy = ActorCritic(state_dim, action_dim)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.clip_epsilon = clip_epsilon

    def update(self, states, actions, advantages, returns, old_log_probs):
        """PPO update with clipped objective."""
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        advantages = torch.FloatTensor(advantages)
        returns = torch.FloatTensor(returns)
        old_log_probs = torch.FloatTensor(old_log_probs)

        for _ in range(10):  # Multiple epochs
            action_probs, values = self.policy(states)
            dist = torch.distributions.Categorical(action_probs)
            new_log_probs = dist.log_prob(actions)
            entropy = dist.entropy().mean()

            # Ratio of new to old policy
            ratio = (new_log_probs - old_log_probs).exp()

            # Clipped surrogate objective
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.clip_epsilon,
                                1 + self.clip_epsilon) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()

            # Combined loss
            critic_loss = F.mse_loss(values.squeeze(), returns)
            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy

            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 0.5)
            self.optimizer.step()

PPO’s clipped objective prevents the policy from changing too drastically, making training stable without complex trust-region optimization. This simplicity has made PPO the default choice for many RL applications.

Exploration vs Exploitation Strategies

Epsilon-Greedy

The simplest exploration strategy: with probability epsilon, take a random action; otherwise, take the greedy action. Common schedules start epsilon high (1.0) and decay to low (0.01).

def epsilon_greedy(q_values, epsilon):
    if np.random.random() < epsilon:
        return np.random.randint(len(q_values))
    return np.argmax(q_values)

Upper Confidence Bound (UCB)

UCB selects actions with optimistic exploration bonuses:

A_t = argmax_a [Q_t(a) + c * sqrt(ln t / N_t(a))]

Where N_t(a) counts how many times action a has been selected. The bonus shrinks as actions are explored more, ensuring each action is eventually tried sufficiently.

Thompson Sampling

Thompson sampling maintains a belief distribution over Q-values and samples from it before selecting actions. This probabilistic approach naturally balances exploration and exploitation and often outperforms UCB in practice.

OpenAI Gym Example

import gym
import numpy as np

# Create environment
env = gym.make('CartPole-v1')
agent = QLearningAgent(env.observation_space.n, env.action_space.n)

# Training
for episode in range(500):
    state = env.reset()
    total_reward = 0
    done = False

    while not done:
        action = agent.choose_action(state)
        next_state, reward, done, _ = env.step(action)
        agent.learn(state, action, reward, next_state)
        total_reward += reward
        state = next_state

    if episode % 50 == 0:
        print(f"Episode {episode}, Reward: {total_reward}")

# For continuous state spaces, use DQN
env = gym.make('LunarLander-v2')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
dqn = DQN(state_dim, action_dim)

Resources

Conclusion

Reinforcement learning has matured into a powerful paradigm for sequential decision-making. From fundamental algorithms like Q-learning to modern methods like PPO and offline RL, the field offers tools for solving problems where optimal behavior must be learned through interaction. As RL algorithms become more sample-efficient and reliable, their adoption in robotics, resource management, and AI systems will continue to grow.

Comments

👍 Was this article helpful?