Recurrent Neural Networks and LSTM: Processing Sequential Data

Introduction

Recurrent Neural Networks (RNNs) are the fundamental architecture for processing sequential data—text, time series, audio, video. Unlike feedforward networks that process each input independently, RNNs maintain hidden state that captures information about previous inputs. This allows them to model temporal dependencies and variable-length sequences. In 2026, while transformers dominate many NLP tasks, RNNs and their variants (LSTM, GRU) remain important for applications where sequential processing, memory efficiency, or causal reasoning are critical.

The key challenge with RNNs is learning long-term dependencies. The vaning gradient problem makes it difficult for standard RNNs to connect information separated by many time steps. Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU) solve this through gating mechanisms that control information flow.

Basic Recurrent Neural Networks

The RNN Equation

At each timestep t, an RNN computes:

h_t = tanh(W_ih * x_t + b_ih + W_hh * h_{t-1} + b_hh)
y_t = W_ho * h_t + b_ho

Where h_t is the hidden state, x_t is the input, and the weight matrices connect input-to-hidden, hidden-to-hidden (recurrent), and hidden-to-output. The hidden state acts as a “memory” of the network.

Processing Sequences

For a sequence input x = (x_1, x_2, …, x_T), the RNN processes one element at a time, updating the hidden state at each step. This is conceptually similar to processing a tape—the network can only see current input and its memory of previous steps.

Different output configurations suit different tasks: one-to-one (classification), one-to-many (image captioning), many-to-one (sentiment analysis), and many-to-many (machine translation).

The Vanishing Gradient Problem

Training RNNs is challenging because gradients propagate backward through time. Each timestep multiplies gradients by weight matrices. With many layers (timesteps), gradients exponentially shrink (vanish) or grow (explode).

When gradients vanish, earlier timesteps receive tiny updates, making it impossible to learn long-range dependencies. Information from distant past effectively disappears from the hidden state.

Long Short-Term Memory (LSTM)

LSTM Architecture

LSTM introduces gating mechanisms to control information flow. The key components:

Forget Gate: f_t = σ(W_f * [h_{t-1}, x_t] + b_f) Decides what information to discard from the cell state.

Input Gate: i_t = σ(W_i * [h_{t-1}, x_t] + b_i) Decides what new information to store in the cell state.

Cell Update: C̃_t = tanh(W_C * [h_{t-1}, x_t] + b_C) Creates candidate values to add to the cell state.

Cell State: C_t = f_t * C_{t-1} + i_t * C̃_t The cell state is the “long-term memory,” updated by combining old state (filtered by forget gate) with new candidate information.

Output Gate: o_t = σ(W_o * [h_{t-1}, x_t] + b_o) Decides what parts of the cell state to output.

Hidden State: h_t = o_t * tanh(C_t) The output is filtered by the output gate.

Why LSTMs Work

The cell state acts as a highway for gradient flow, allowing gradients to propagate unchanged across many timesteps. This addresses the vanishing gradient problem, enabling LSTMs to learn dependencies spanning hundreds or thousands of steps.

The gates control information flow dynamically: some inputs update memory significantly; others leave it unchanged. This selective memory allows LSTMs to remember important information for as long as needed.

Gated Recurrent Units (GRU)

Simplified Gating

GRU combines the forget and input gates into a single update gate:

Update Gate: z_t = σ(W_z * [h_{t-1}, x_t]) Controls how much previous hidden state to carry forward.

Reset Gate: r_t = σ(W_r * [h_{t-1}, x_t]) Determines how much past information to ignore.

Candidate Hidden: h̃_t = tanh(W * [r_t * h_{t-1}, x_t]) Creates a candidate for new hidden state.

Hidden State: h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t Interpolates between previous and candidate hidden states.

GRU vs LSTM

GRU has fewer parameters (2 gates vs. 3 in LSTM) and is often faster to train. Performance is similar; either can work better depending on the task. For small datasets, fewer parameters may reduce overfitting.

Training Recurrent Networks

Backpropagation Through Time (BPTT)

Training RNNs uses BPTT: unroll the network through time, compute gradients at each step, then sum. This requires storing all intermediate states, consuming significant memory.

Truncated BPTT limits how far back gradients propagate, trading some long-term modeling for efficiency. This works well for very long sequences.

Handling Variable-Length Sequences

Sequence padding batches different-length sequences together. Masking tells the network to ignore padded values in loss computation and pooling. Pack padding (available in PyTorch) efficiently stores variable-length sequences.

Bidirectional Processing

Bidirectional RNNs process sequences in both directions, combining forward and backward hidden states. This provides context from both past and future, improving accuracy for tasks like named entity recognition where complete context matters.

Applications

Natural Language Processing

Before transformers, LSTMs powered NLP: machine translation, language modeling, text generation. While largely superseded by transformers, LSTMs remain relevant for resource-constrained applications and when sequential processing order matters.

Time Series Forecasting

LSTMs predict future values in financial data, sensor readings, and weather. Their ability to model temporal dependencies makes them suitable for forecasting tasks where patterns span multiple timesteps.

Speech Recognition

Deep speech recognition systems used LSTMs for acoustic modeling, converting audio features to text. Connectionist temporal classification (CTC) enabled training without frame-level alignments.

Music Generation

Music generation uses LSTMs to predict notes in sequence. The hierarchical temporal structure of music (notes, measures, phrases) maps naturally to recurrent processing.

Modern RNN Variants

Attention Mechanisms in RNNs

Adding attention to sequence-to-sequence models was a key breakthrough before transformers. Attention lets the decoder access all encoder hidden states, weighting them by relevance. This improved translation and enabled handling longer sequences.

Layer Normalization

Layer normalization normalizes activations within each timestep, stabilizing training. It has largely replaced batch normalization in RNNs because sequence lengths vary across batches.

Regularization

Dropout applied to recurrent connections prevents overfitting. Variational dropout drops the same units across timesteps, maintaining temporal consistency.

Implementing RNNs

LSTM Implementation in PyTorch

import torch
import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim, 
            hidden_dim, 
            num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.3
        )
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out: (batch, seq_len, hidden_dim * 2)
        
        # Use final hidden states from both directions
        forward_hidden = hidden[-2, :, :]
        backward_hidden = hidden[-1, :, :]
        combined = torch.cat((forward_hidden, backward_hidden), dim=1)
        
        output = self.fc(combined)
        return output

GRU Implementation

class GRUModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        super().__init__()
        self.gru = nn.GRU(
            input_dim, 
            hidden_dim, 
            num_layers,
            batch_first=True,
            dropout=0.2
        )
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        output, hidden = self.gru(x)
        # output: (batch, seq_len, hidden_dim)
        # hidden: (num_layers, batch, hidden_dim)
        
        # Use final hidden state
        final_hidden = hidden[-1, :, :]
        prediction = self.fc(final_hidden)
        return prediction

RNNs vs Transformers

Complementary Strengths

Transformers dominate many NLP tasks due to: parallel processing (all tokens attended simultaneously), stronger modeling of long-range dependencies, and massive scaling benefits. RNNs process sequentially—slower but more interpretable.

RNNs retain advantages: linear memory complexity (vs. quadratic for attention), inherent handling of causal structure, and better performance with limited data. For streaming data and real-time processing, RNNs remain practical.

Modern Use Cases

RNNs excel in: online/incremental prediction, resource-constrained deployment, tasks requiring strict ordering, and when interpretability matters. Many production systems use hybrid approaches: transformer encoders with RNN decoders, or fine-tuned RNNs for specific tasks.

Resources

Conclusion

Recurrent Neural Networks and their LSTM/GRU variants established deep learning for sequential data. Understanding sequential processing, hidden state management, and gating mechanisms provides essential foundations for modern sequence modeling. While transformers have become dominant for many NLP tasks, RNNs remain important for specific applications and continue to evolve in hybrid architectures.