Recurrent Neural Networks and LSTM: Processing Sequential Data

Introduction

Recurrent Neural Networks (RNNs) are the fundamental architecture for processing sequential data—text, time series, audio, video. Unlike feedforward networks that process each input independently, RNNs maintain hidden state that captures information about previous inputs. This allows them to model temporal dependencies and variable-length sequences. In 2026, while transformers dominate many NLP tasks, RNNs and their variants (LSTM, GRU) remain important for applications where sequential processing, memory efficiency, or causal reasoning are critical.

The key challenge with RNNs is learning long-term dependencies. The vaning gradient problem makes it difficult for standard RNNs to connect information separated by many time steps. Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU) solve this through gating mechanisms that control information flow.

Basic Recurrent Neural Networks

The RNN Equation

At each timestep t, an RNN computes:

h_t = tanh(W_ih * x_t + b_ih + W_hh * h_{t-1} + b_hh)
y_t = W_ho * h_t + b_ho

Where h_t is the hidden state, x_t is the input, and the weight matrices connect input-to-hidden, hidden-to-hidden (recurrent), and hidden-to-output. The hidden state acts as a “memory” of the network.

Processing Sequences

For a sequence input x = (x_1, x_2, …, x_T), the RNN processes one element at a time, updating the hidden state at each step. This is conceptually similar to processing a tape—the network can only see current input and its memory of previous steps.

Different output configurations suit different tasks: one-to-one (classification), one-to-many (image captioning), many-to-one (sentiment analysis), and many-to-many (machine translation).

The Vanishing Gradient Problem

Training RNNs is challenging because gradients propagate backward through time. Each timestep multiplies gradients by weight matrices. With many layers (timesteps), gradients exponentially shrink (vanish) or grow (explode).

When gradients vanish, earlier timesteps receive tiny updates, making it impossible to learn long-range dependencies. Information from distant past effectively disappears from the hidden state.

Long Short-Term Memory (LSTM)

LSTM Architecture

LSTM introduces gating mechanisms to control information flow. The key components:

Forget Gate: f_t = σ(W_f * [h_{t-1}, x_t] + b_f) Decides what information to discard from the cell state.

Input Gate: i_t = σ(W_i * [h_{t-1}, x_t] + b_i) Decides what new information to store in the cell state.

Cell Update: C̃_t = tanh(W_C * [h_{t-1}, x_t] + b_C) Creates candidate values to add to the cell state.

Cell State: C_t = f_t * C_{t-1} + i_t * C̃_t The cell state is the “long-term memory,” updated by combining old state (filtered by forget gate) with new candidate information.

Output Gate: o_t = σ(W_o * [h_{t-1}, x_t] + b_o) Decides what parts of the cell state to output.

Hidden State: h_t = o_t * tanh(C_t) The output is filtered by the output gate.

Why LSTMs Work

The cell state acts as a highway for gradient flow, allowing gradients to propagate unchanged across many timesteps. This addresses the vanishing gradient problem, enabling LSTMs to learn dependencies spanning hundreds or thousands of steps.

The gates control information flow dynamically: some inputs update memory significantly; others leave it unchanged. This selective memory allows LSTMs to remember important information for as long as needed.

Gated Recurrent Units (GRU)

Simplified Gating

GRU combines the forget and input gates into a single update gate:

Update Gate: z_t = σ(W_z * [h_{t-1}, x_t]) Controls how much previous hidden state to carry forward.

Reset Gate: r_t = σ(W_r * [h_{t-1}, x_t]) Determines how much past information to ignore.

Candidate Hidden: h̃_t = tanh(W * [r_t * h_{t-1}, x_t]) Creates a candidate for new hidden state.

Hidden State: h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t Interpolates between previous and candidate hidden states.

GRU vs LSTM

GRU has fewer parameters (2 gates vs. 3 in LSTM) and is often faster to train. Performance is similar; either can work better depending on the task. For small datasets, fewer parameters may reduce overfitting.

Training Recurrent Networks

Backpropagation Through Time (BPTT)

Training RNNs uses BPTT: unroll the network through time, compute gradients at each step, then sum. This requires storing all intermediate states, consuming significant memory.

Truncated BPTT limits how far back gradients propagate, trading some long-term modeling for efficiency. This works well for very long sequences.

Handling Variable-Length Sequences

Sequence padding batches different-length sequences together. Masking tells the network to ignore padded values in loss computation and pooling. Pack padding (available in PyTorch) efficiently stores variable-length sequences.

Bidirectional Processing

Bidirectional RNNs process sequences in both directions, combining forward and backward hidden states. This provides context from both past and future, improving accuracy for tasks like named entity recognition where complete context matters.

Applications

Natural Language Processing

Before transformers, LSTMs powered NLP: machine translation, language modeling, text generation. While largely superseded by transformers, LSTMs remain relevant for resource-constrained applications and when sequential processing order matters.

Time Series Forecasting

LSTMs predict future values in financial data, sensor readings, and weather. Their ability to model temporal dependencies makes them suitable for forecasting tasks where patterns span multiple timesteps.

Speech Recognition

Deep speech recognition systems used LSTMs for acoustic modeling, converting audio features to text. Connectionist temporal classification (CTC) enabled training without frame-level alignments.

Music Generation

Music generation uses LSTMs to predict notes in sequence. The hierarchical temporal structure of music (notes, measures, phrases) maps naturally to recurrent processing.

Modern RNN Variants

Attention Mechanisms in RNNs

Adding attention to sequence-to-sequence models was a key breakthrough before transformers. Attention lets the decoder access all encoder hidden states, weighting them by relevance. This improved translation and enabled handling longer sequences.

Layer Normalization

Layer normalization normalizes activations within each timestep, stabilizing training. It has largely replaced batch normalization in RNNs because sequence lengths vary across batches.

Regularization

Dropout applied to recurrent connections prevents overfitting. Variational dropout drops the same units across timesteps, maintaining temporal consistency.

Implementing RNNs

LSTM Implementation in PyTorch

import torch
import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim, 
            hidden_dim, 
            num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.3
        )
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out: (batch, seq_len, hidden_dim * 2)
        
        # Use final hidden states from both directions
        forward_hidden = hidden[-2, :, :]
        backward_hidden = hidden[-1, :, :]
        combined = torch.cat((forward_hidden, backward_hidden), dim=1)
        
        output = self.fc(combined)
        return output

GRU Implementation

class GRUModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        super().__init__()
        self.gru = nn.GRU(
            input_dim, 
            hidden_dim, 
            num_layers,
            batch_first=True,
            dropout=0.2
        )
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        output, hidden = self.gru(x)
        # output: (batch, seq_len, hidden_dim)
        # hidden: (num_layers, batch, hidden_dim)
        
        # Use final hidden state
        final_hidden = hidden[-1, :, :]
        prediction = self.fc(final_hidden)
        return prediction

RNNs vs Transformers

Complementary Strengths

Transformers dominate many NLP tasks due to: parallel processing (all tokens attended simultaneously), stronger modeling of long-range dependencies, and massive scaling benefits. RNNs process sequentially—slower but more interpretable.

RNNs retain advantages: linear memory complexity (vs. quadratic for attention), inherent handling of causal structure, and better performance with limited data. For streaming data and real-time processing, RNNs remain practical.

Modern Use Cases

RNNs excel in: online/incremental prediction, resource-constrained deployment, tasks requiring strict ordering, and when interpretability matters. Many production systems use hybrid approaches: transformer encoders with RNN decoders, or fine-tuned RNNs for specific tasks.

Vanishing and Exploding Gradients: Mathematical Analysis

Why Gradients Vanish

During backpropagation through time, the gradient at step k involves repeated multiplication by the recurrent weight matrix W_hh:

partial L / partial h_k = partial L / partial h_T * Pi_{t=k}^{T-1} diag(sigma'(h_t)) * W_hh

The product of T-k matrices causes exponential decay if the eigenvalues of W_hh are less than 1, or explosion if greater than 1. For a tanh activation with derivative <= 1, gradients shrink with each timestep.

Gradient Clipping

# Clip gradients to prevent explosion
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Or clip by value
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=5.0)

Gradient clipping limits the maximum gradient norm, preventing extreme updates while preserving direction. Combined with proper initialization (orthogonal or identity), it makes training deep RNNs feasible.

LSTM Cell Implementation

From Scratch

class LSTMCell(nn.Module):
    """LSTM cell implemented from scratch."""

    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Combined gates for efficiency
        self.W_ih = nn.Parameter(torch.randn(4 * hidden_size, input_size) * 0.01)
        self.W_hh = nn.Parameter(torch.randn(4 * hidden_size, hidden_size) * 0.01)
        self.bias = nn.Parameter(torch.zeros(4 * hidden_size))

    def forward(self, x, state):
        h_prev, c_prev = state

        # Compute all gates at once
        gates = (x @ self.W_ih.T + h_prev @ self.W_hh.T + self.bias)

        # Split into four gates
        i_gate, f_gate, g_gate, o_gate = gates.chunk(4, dim=-1)

        # Apply activations
        i = torch.sigmoid(i_gate)  # Input gate
        f = torch.sigmoid(f_gate)  # Forget gate
        g = torch.tanh(g_gate)     # Cell candidate
        o = torch.sigmoid(o_gate)  # Output gate

        # Update cell state and hidden state
        c = f * c_prev + i * g
        h = o * torch.tanh(c)

        return h, (h, c)

Peephole LSTM Variant

Peephole connections allow gates to observe the cell state directly:

class PeepholeLSTMCell(nn.Module):
    """LSTM with peephole connections."""

    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.W_i = nn.Linear(input_size + hidden_size + hidden_size, hidden_size)
        self.W_f = nn.Linear(input_size + hidden_size + hidden_size, hidden_size)
        self.W_o = nn.Linear(input_size + hidden_size + hidden_size, hidden_size)
        self.W_c = nn.Linear(input_size + hidden_size, hidden_size)

    def forward(self, x, state):
        h_prev, c_prev = state
        combined = torch.cat([x, h_prev], dim=-1)

        # Peephole: concat cell state to gate inputs
        i = torch.sigmoid(self.W_i(torch.cat([combined, c_prev], dim=-1)))
        f = torch.sigmoid(self.W_f(torch.cat([combined, c_prev], dim=-1)))
        c = f * c_prev + i * torch.tanh(self.W_c(combined))
        o = torch.sigmoid(self.W_o(torch.cat([combined, c], dim=-1)))
        h = o * torch.tanh(c)

        return h, (h, c)

GRU Cell Implementation

From Scratch

class GRUCell(nn.Module):
    """GRU cell from scratch."""

    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.W_ir = nn.Linear(input_size, 2 * hidden_size)
        self.W_hr = nn.Linear(hidden_size, 2 * hidden_size)
        self.W_in = nn.Linear(input_size, hidden_size)
        self.W_hn = nn.Linear(hidden_size, hidden_size)

    def forward(self, x, h_prev):
        # Gates
        r_z = torch.sigmoid(self.W_ir(x) + self.W_hr(h_prev))
        r, z = r_z.chunk(2, dim=-1)

        # Candidate hidden state
        n = torch.tanh(self.W_in(x) + r * self.W_hn(h_prev))

        # Update hidden state
        h = (1 - z) * h_prev + z * n
        return h

GRU vs LSTM: Detailed Comparison

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
Parameters	4(hidden^2 + hidden*input)	3(hidden^2 + hidden*input)
Cell state	Separate memory cell	No separate cell
Gradient flow	Through cell state directly	Through hidden state
Computational cost	Higher	Lower
Expressive power	More flexible	Similar in practice

For small datasets, GRU often generalizes better due to fewer parameters. For very long sequences, LSTM’s explicit cell state may be more reliable.

Bidirectional RNNs

Bidirectional RNNs process sequences in both directions simultaneously, capturing future and past context:

class BiLSTM(nn.Module):
    """Bidirectional LSTM."""

    def __init__(self, input_size, hidden_size, num_layers=2):
        super().__init__()
        self.forward_lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.backward_lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size * 2, 1)

    def forward(self, x):
        # Forward pass
        fwd_out, _ = self.forward_lstm(x)

        # Backward pass: reverse sequence
        rev_x = torch.flip(x, dims=[1])
        bwd_out, _ = self.backward_lstm(rev_x)
        bwd_out = torch.flip(bwd_out, dims=[1])

        # Concatenate forward and backward
        combined = torch.cat([fwd_out, bwd_out], dim=-1)
        return self.fc(combined)

Bidirectional processing is standard for NER, POS tagging, and any task where complete context improves accuracy. For causal tasks (prediction, generation), only forward direction is valid.

Seq2Seq with Attention

The encoder-decoder architecture with attention enables handling variable-length input and output sequences:

class Seq2SeqAttention(nn.Module):
    """Seq2Seq model with additive attention."""

    def __init__(self, encoder, decoder, hidden_size):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.attention = nn.Linear(hidden_size * 2, 1)
        self.hidden_size = hidden_size

    def forward(self, src, trg):
        # Encode source sequence
        encoder_outputs, hidden = self.encoder(src)

        # Decode with attention
        outputs = []
        decoder_input = trg[:, 0:1]

        for t in range(1, trg.shape[1]):
            decoder_state = hidden[0].unsqueeze(1)
            enc_states = encoder_outputs.unsqueeze(0)

            # Score each encoder state
            energy = torch.tanh(self.attention(
                torch.cat([decoder_state.expand(-1, encoder_outputs.shape[1], -1),
                          enc_states], dim=-1)
            ))
            attention_weights = F.softmax(energy.squeeze(-1), dim=1)

            # Context vector
            context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)

            # Decode
            decoder_input = torch.cat([decoder_input, context], dim=-1)
            output, hidden = self.decoder(decoder_input, hidden)
            outputs.append(output)
            decoder_input = trg[:, t:t+1]

        return torch.stack(outputs, dim=1)

Time Series Forecasting with LSTM

class TimeSeriesLSTM(nn.Module):
    """LSTM for multivariate time series forecasting."""

    def __init__(self, n_features, n_hidden=64, n_layers=2, n_output=1):
        super().__init__()
        self.lstm = nn.LSTM(n_features, n_hidden, n_layers,
                           batch_first=True, dropout=0.2)
        self.regressor = nn.Sequential(
            nn.Linear(n_hidden, 32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, n_output)
        )

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        last_out = lstm_out[:, -1, :]
        return self.regressor(last_out)

# Training example
model = TimeSeriesLSTM(n_features=5, n_output=1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
    pred = model(train_x)
    loss = criterion(pred, train_y)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Speech and Audio Processing

Connectionist Temporal Classification (CTC) enables training sequence models without frame-level alignments:

class SpeechLSTM(nn.Module):
    """LSTM with CTC for speech recognition."""

    def __init__(self, n_mels=80, n_hidden=256, n_classes=29):
        super().__init__()
        self.lstm = nn.LSTM(n_mels, n_hidden, 4,
                           batch_first=True, bidirectional=True, dropout=0.3)
        self.classifier = nn.Linear(n_hidden * 2, n_classes)

    def forward(self, spectrogram):
        out, _ = self.lstm(spectrogram)
        logits = self.classifier(out)
        return F.log_softmax(logits, dim=-1)

# CTC loss handles variable-length alignment automatically
criterion = nn.CTCLoss(blank=0)

Resources

Conclusion

Recurrent Neural Networks and their LSTM/GRU variants established deep learning for sequential data. Understanding sequential processing, hidden state management, and gating mechanisms provides essential foundations for modern sequence modeling. While transformers have become dominant for many NLP tasks, RNNs remain important for specific applications and continue to evolve in hybrid architectures.

Introduction

Basic Recurrent Neural Networks

The RNN Equation

Processing Sequences

The Vanishing Gradient Problem

Long Short-Term Memory (LSTM)

LSTM Architecture

Why LSTMs Work

Gated Recurrent Units (GRU)

Simplified Gating

GRU vs LSTM

Training Recurrent Networks

Backpropagation Through Time (BPTT)

Handling Variable-Length Sequences

Bidirectional Processing

Applications

Natural Language Processing

Time Series Forecasting

Speech Recognition

Music Generation

Modern RNN Variants

Attention Mechanisms in RNNs

Layer Normalization

Regularization

Implementing RNNs

LSTM Implementation in PyTorch

GRU Implementation

RNNs vs Transformers

Complementary Strengths

Modern Use Cases

Vanishing and Exploding Gradients: Mathematical Analysis

Why Gradients Vanish

Gradient Clipping

LSTM Cell Implementation

From Scratch

Peephole LSTM Variant

GRU Cell Implementation

From Scratch

GRU vs LSTM: Detailed Comparison

Bidirectional RNNs

Seq2Seq with Attention

Time Series Forecasting with LSTM

Speech and Audio Processing

Resources

Conclusion

Comments

Share this article

👍 Was this article helpful?