Efficient Long-Context LLM: Strategies for Million-Token Contexts

Introduction

The ability to process long contexts—sequences of 100K tokens or more—has become essential for modern language model applications. Document analysis, code repositories, book summarization, and extended conversations all require models that can effectively utilize information spread across long sequences. However, standard transformers face a fundamental challenge: the quadratic complexity of attention makes processing long sequences computationally prohibitive.

Efficient long-context techniques address this challenge through various strategies that reduce the computational cost of attention while preserving the model’s ability to capture long-range dependencies. These techniques range from simple modifications like sliding window attention to sophisticated approaches like dynamic hierarchical sparse attention. Understanding these techniques is essential for building applications that require long-context capabilities without excessive computational costs.

This article provides a comprehensive overview of efficient long-context strategies, covering the theoretical foundations, practical implementations, and trade-offs of each approach. By understanding these techniques, practitioners can select appropriate methods for their applications and combine multiple strategies for maximum efficiency.

The Long-Context Challenge

Standard self-attention computes interactions between all pairs of tokens, resulting in quadratic time and memory complexity with respect to sequence length. For a sequence of length n with hidden dimension d, the attention computation requires O(n²d) operations and O(nd) memory for the key-value cache. As n grows to 100K or 1M tokens, these costs become prohibitive for practical deployment.

The quadratic complexity creates several practical challenges. Inference latency grows quadratically with context length, making real-time generation impossible for long contexts. Memory requirements for the key-value cache can exceed GPU memory, limiting the maximum context length that can be processed. Training costs grow even faster, as attention must be computed for all positions in every training step.

These challenges have motivated extensive research into efficient attention mechanisms that reduce the quadratic complexity while preserving the modeling capacity of full attention. The resulting techniques enable practical long-context applications while maintaining model quality.

Sliding Window Attention

Sliding window attention is the simplest and most widely used efficient attention mechanism. Rather than attending to all previous tokens, each token attends only to a fixed-size window of recent tokens. This restriction reduces complexity from O(n²) to O(nw), where w is the window size, enabling linear scaling with sequence length.

The key insight behind sliding window attention is that most tokens primarily attend to nearby context. In natural language, grammatical dependencies and semantic relationships are typically local, with long-range dependencies being the exception rather than the rule. By limiting attention to a local window, sliding window attention captures most of the useful interactions while dramatically reducing computation.

Standard sliding window attention uses a fixed window size across all layers. However, research has shown that different layers benefit from different window sizes. Lower layers may need larger windows to capture local patterns, while higher layers may focus on more abstract representations that require less local context. Adaptive window sizes that vary across layers can improve efficiency while maintaining quality.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SlidingWindowAttention(nn.Module):
    """Sliding window attention with configurable window size."""
    
    def __init__(self, d_model, n_heads, window_size=512, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.window_size = window_size
        
        # Q, K, V projections
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.output_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, attention_mask=None):
        """Forward pass with sliding window attention."""
        batch_size, seq_len, d_model = x.shape
        
        # Project to heads
        q = self.q_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        k = self.k_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        v = self.v_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        
        # Compute attention scores
        # Scale dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        
        # Create sliding window mask
        # Each position attends to the previous window_size positions
        mask = torch.ones(seq_len, seq_len, device=x.device)
        mask = torch.tril(mask, diagonal=0)  # Causal mask
        mask = mask.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, seq_len)
        
        # Apply window size limit
        for i in range(seq_len):
            start = max(0, i - self.window_size + 1)
            mask[:, :, i, :start] = 0
        
        if attention_mask is not None:
            mask = mask * attention_mask.unsqueeze(1).unsqueeze(1)
        
        # Apply mask and softmax
        scores = scores.masked_fill(mask == 0, -1e9)
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Compute output
        output = torch.matmul(attn_weights, v)
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, seq_len, d_model)
        
        return self.output_proj(output)


class HierarchicalSlidingWindowAttention(nn.Module):
    """Hierarchical attention with multiple window sizes."""
    
    def __init__(self, d_model, n_heads, window_sizes=[64, 512, 4096], dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.window_sizes = window_sizes
        
        # Separate attention heads for each window size
        self.attentions = nn.ModuleList([
            SlidingWindowAttention(d_model, n_heads, ws, dropout)
            for ws in window_sizes
        ])
        
        # Combination projection
        self.combine_proj = nn.Linear(d_model * len(window_sizes), d_model)
        
    def forward(self, x, attention_mask=None):
        """Forward pass with hierarchical attention."""
        outputs = []
        for attention in self.attentions:
            output = attention(x, attention_mask)
            outputs.append(output)
        
        # Combine outputs from different window sizes
        combined = torch.cat(outputs, dim=-1)
        return self.combine_proj(combined)

The sliding window approach provides a good balance between efficiency and quality for many applications. The window size is a key hyperparameter that trades off between efficiency (smaller windows) and quality (larger windows). Typical window sizes range from 512 to 4096 tokens, depending on the application and model scale.

Hierarchical Attention

Hierarchical attention processes sequences at multiple levels of abstraction, reducing the effective sequence length at higher levels. This approach is particularly effective for documents with natural hierarchical structure, such as sections, paragraphs, and sentences.

The hierarchical approach typically works as follows. First, tokens are grouped into segments (sentences or paragraphs). Within each segment, standard or sliding window attention captures local relationships. Then, segment representations are computed (often through pooling or summarization), and attention operates over segments at the higher level. This two-level attention captures both local patterns within segments and global relationships between segments.

Hierarchical attention can be extended to multiple levels, creating a pyramid of representations with decreasing sequence length and increasing abstraction. The number of levels and the grouping at each level are design choices that depend on the input structure. For natural language, a three-level hierarchy (token, sentence, document) often works well.

The key advantage of hierarchical attention is its ability to capture very long-range dependencies without quadratic cost. By operating over compressed representations at higher levels, hierarchical attention can relate information from distant parts of the sequence that would be computationally prohibitive to connect directly.

Sparse Attention Patterns

Sparse attention generalizes sliding window attention by allowing arbitrary attention patterns that skip many token pairs. Rather than attending to all tokens within a window, sparse attention selects a subset of tokens to attend to based on predefined patterns or learned importance.

Static sparse patterns define attention connectivity in advance, without considering the input content. Examples include strided patterns that attend to every k-th token, fixed sparse patterns that attend to specific positions, and dilated patterns that combine local and sparse global attention. Static patterns are simple to implement and have predictable performance characteristics.

Dynamic sparse attention selects attention connections based on input content, enabling the model to focus on the most relevant tokens for each query. Dynamic patterns can adapt to the specific input, attending to different positions for different queries. This adaptability can improve quality but requires additional computation for pattern selection.

import torch
import torch.nn as nn
import torch.nn.functional as F

class StaticSparseAttention(nn.Module):
    """Static sparse attention with predefined patterns."""
    
    def __init__(self, d_model, n_heads, num_local=256, num_global=16, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.num_local = num_local
        self.num_global = num_global
        
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.output_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, attention_mask=None):
        """Forward pass with static sparse attention."""
        batch_size, seq_len, d_model = x.shape
        
        q = self.q_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        k = self.k_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        v = self.v_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        
        # Create sparse attention pattern
        # Local attention: attend to recent tokens
        # Global attention: attend to specific positions (e.g., first token, every k-th token)
        
        # Local window
        local_mask = torch.zeros(seq_len, seq_len, device=x.device)
        for i in range(seq_len):
            start = max(0, i - self.num_local + 1)
            local_mask[i, start:i+1] = 1
        
        # Global positions (e.g., first token and every 100th token)
        global_positions = torch.cat([
            torch.tensor([0]),
            torch.arange(100, seq_len, 100)
        ])
        global_mask = torch.zeros(seq_len, seq_len, device=x.device)
        global_mask[:, global_positions] = 1
        
        # Combine local and global
        sparse_mask = (local_mask + global_mask).clamp(max=1)
        sparse_mask = sparse_mask.unsqueeze(0).unsqueeze(0)
        
        if attention_mask is not None:
            sparse_mask = sparse_mask * attention_mask.unsqueeze(1).unsqueeze(1)
        
        # Compute attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        scores = scores.masked_fill(sparse_mask == 0, -1e9)
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        output = torch.matmul(attn_weights, v)
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, seq_len, d_model)
        
        return self.output_proj(output)


class DynamicHierarchicalSparseAttention(nn.Module):
    """Dynamic hierarchical sparse attention for on-device LLMs."""
    
    def __init__(self, d_model, n_heads, block_size=64, top_k=8, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.block_size = block_size
        self.top_k = top_k
        
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.output_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
        # Importance predictor for dynamic selection
        self.importance_net = nn.Linear(d_model, 1)
        
    def forward(self, x, attention_mask=None):
        """Forward pass with dynamic hierarchical sparse attention."""
        batch_size, seq_len, d_model = x.shape
        
        # Project queries, keys, values
        q = self.q_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        k = self.k_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        v = self.v_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim)
        
        # Compute block-level importance scores
        # Reshape into blocks
        num_blocks = seq_len // self.block_size
        if seq_len % self.block_size != 0:
            # Pad sequence
            pad_len = self.block_size - (seq_len % self.block_size)
            q = F.pad(q, (0, 0, 0, pad_len))
            k = F.pad(k, (0, 0, 0, pad_len))
            v = F.pad(v, (0, 0, 0, pad_len))
            seq_len = num_blocks * self.block_size + pad_len
            num_blocks = seq_len // self.block_size
        
        # Reshape to blocks
        q_blocks = q.view(batch_size, num_blocks, self.block_size, self.n_heads, self.head_dim)
        k_blocks = k.view(batch_size, num_blocks, self.block_size, self.n_heads, self.head_dim)
        v_blocks = v.view(batch_size, num_blocks, self.block_size, self.n_heads, self.head_dim)
        
        # Compute block-level keys (mean pooling)
        block_keys = k_blocks.mean(dim=2)  # (batch, num_blocks, heads, head_dim)
        
        # Compute importance scores for each query block to each key block
        q_block_mean = q_blocks.mean(dim=2)  # (batch, num_blocks, heads, head_dim)
        importance = torch.matmul(q_block_mean, block_keys.transpose(-2, -1))
        importance = importance / (self.head_dim ** 0.5)
        
        # Select top-k most important blocks for each query block
        topk_importance, topk_indices = torch.topk(importance, self.top_k, dim=-1)
        
        # Build sparse attention pattern
        sparse_mask = torch.zeros(batch_size, seq_len, seq_len, device=x.device)
        
        for b in range(batch_size):
            for i in range(num_blocks):
                for h in range(self.n_heads):
                    selected_blocks = topk_indices[b, i, h]
                    for block_idx in selected_blocks:
                        start = block_idx * self.block_size
                        end = start + self.block_size
                        sparse_mask[b, i*self.block_size:(i+1)*self.block_size, 
                                    start:end] = 1
        
        # Also include local attention within each block
        for i in range(num_blocks):
            start = i * self.block_size
            end = start + self.block_size
            sparse_mask[:, start:end, start:end] = 1
        
        sparse_mask = sparse_mask.unsqueeze(1)
        
        if attention_mask is not None:
            sparse_mask = sparse_mask * attention_mask.unsqueeze(1)
        
        # Compute attention with sparse mask
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        scores = scores.masked_fill(sparse_mask == 0, -1e9)
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        output = torch.matmul(attn_weights, v)
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, -1, d_model)  # Handle padding
        
        return self.output_proj(output)

Sparse attention patterns can achieve significant efficiency gains while maintaining quality. The key is selecting the right pattern for the application. Static patterns are simpler and more predictable, while dynamic patterns can adapt to input content but require additional computation for pattern selection.

KV Cache Optimization

The key-value cache is a major memory bottleneck for long-context inference. Standard attention stores all key and value vectors for previous tokens, requiring O(nd) memory for a sequence of length n. For long sequences, this cache can dominate memory consumption, limiting the maximum context length.

KV cache compression techniques reduce memory usage by storing compressed or selected key-value representations rather than all tokens. These techniques can dramatically reduce memory requirements while preserving the model’s ability to attend to relevant historical information.

KV cache eviction removes less important tokens from the cache as new tokens are added. Eviction strategies include removing the oldest tokens (FIFO), removing tokens with lowest attention scores, and learned eviction policies that decide which tokens to keep. The trade-off is between memory savings and the potential loss of important historical information.

import torch
import torch.nn as nn

class KVCacheEviction(nn.Module):
    """KV cache with eviction for long-context inference."""
    
    def __init__(self, d_model, max_cache_size=4096, eviction_strategy="attention"):
        super().__init__()
        self.d_model = d_model
        self.max_cache_size = max_cache_size
        self.eviction_strategy = eviction_strategy
        
    def update_cache(self, k, v, current_cache_k, current_cache_v, attention_scores=None):
        """Update KV cache with eviction."""
        batch_size, seq_len, d_model = k.shape
        
        if current_cache_k is None:
            # Initialize cache
            return k, v
        
        # Concatenate new keys/values with cache
        combined_k = torch.cat([current_cache_k, k], dim=1)
        combined_v = torch.cat([current_cache_v, v], dim=1)
        
        # Check if eviction is needed
        if combined_k.size(1) <= self.max_cache_size:
            return combined_k, combined_v
        
        # Eviction based on strategy
        if self.eviction_strategy == "fifo":
            # Remove oldest tokens
            combined_k = combined_k[:, -self.max_cache_size:]
            combined_v = combined_v[:, -self.max_cache_size:]
            
        elif self.eviction_strategy == "attention":
            # Remove tokens with lowest attention scores
            if attention_scores is not None:
                # Compute importance as sum of attention scores
                importance = attention_scores.sum(dim=[0, 2])  # Sum over queries and heads
                # Keep tokens with highest importance
                _, keep_indices = torch.topk(importance, self.max_cache_size, largest=True)
                keep_indices, _ = torch.sort(keep_indices)
                combined_k = combined_k[:, keep_indices]
                combined_v = combined_v[:, keep_indices]
            else:
                # Fall back to FIFO
                combined_k = combined_k[:, -self.max_cache_size:]
                combined_v = combined_v[:, -self.max_cache_size:]
                
        elif self.eviction_strategy == "learned":
            # Use a learned model to predict token importance
            # Simplified: remove middle tokens
            total_len = combined_k.size(1)
            keep_indices = torch.cat([
                torch.arange(0, 1024),
                torch.arange(total_len - 1024, total_len)
            ])
            combined_k = combined_k[:, keep_indices]
            combined_v = combined_v[:, keep_indices]
        
        return combined_k, combined_v


class KVCacheAllocation(nn.Module):
    """KV cache with intelligent allocation across tokens, layers, and heads."""
    
    def __init__(self, d_model, n_heads, total_budget=4096):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.total_budget = total_budget
        
        # Learnable allocation network
        self.allocation_net = nn.Linear(d_model, n_heads)
        
    def compute_allocation(self, x):
        """Compute cache allocation scores for each head."""
        allocation_scores = torch.sigmoid(self.allocation_net(x))  # (batch, seq, heads)
        return allocation_scores
    
    def allocate_cache(self, k, v, allocation_scores):
        """Allocate cache budget based on importance scores."""
        batch_size, seq_len, d_model = k.shape
        head_dim = d_model // self.n_heads
        
        # Compute per-head budget
        total_score = allocation_scores.sum(dim=[0, 1])  # (heads,)
        per_head_budget = (allocation_scores / (total_score + 1e-8) * self.total_budget).long()
        per_head_budget = per_head_budget.clamp(min=1)
        
        # Allocate cache for each head
        allocated_k = []
        allocated_v = []
        
        for h in range(self.n_heads):
            head_k = k[:, :, h*head_dim:(h+1)*head_dim]
            head_v = v[:, :, h*head_dim:(h+1)*head_dim]
            head_scores = allocation_scores[:, :, h]
            
            # Select top tokens for this head
            budget = min(per_head_budget[0, h].item(), seq_len)
            _, top_indices = torch.topk(head_scores.squeeze(-1), budget, dim=-1)
            top_indices, _ = torch.sort(top_indices)
            
            allocated_k.append(head_k[:, top_indices])
            allocated_v.append(head_v[:, top_indices])
        
        # Concatenate heads
        allocated_k = torch.cat(allocated_k, dim=-1)
        allocated_v = torch.cat(allocated_v, dim=-1)
        
        return allocated_k, allocated_v

KV cache optimization is essential for practical long-context deployment. The combination of eviction, compression, and intelligent allocation can reduce memory usage by 90% or more while maintaining model quality for most applications.

Long-Context Fine-tuning

Models pretrained with standard attention may not generalize well to sliding window or sparse attention during inference. This training-inference mismatch can cause significant quality degradation when applying efficient attention techniques to pretrained models.

Long-context fine-tuning adapts pretrained models to efficient attention mechanisms by continuing training with the target attention pattern. During fine-tuning, the model learns to work within the constraints of the efficient attention mechanism, developing representations that are effective even with limited attention scope.

The fine-tuning process typically uses long sequences (32K+ tokens) with the target efficient attention mechanism. The model learns to capture long-range dependencies through the available attention patterns, developing strategies for maintaining information across the limited attention window.

Progressive lengthening is a common fine-tuning strategy that gradually increases sequence length. Starting with shorter sequences and progressively increasing to longer lengths allows the model to adapt gradually, avoiding the instability that can result from sudden exposure to very long sequences.

Evaluation and Benchmarks

Evaluating long-context models requires specialized benchmarks that test different aspects of long-range understanding. Understanding these benchmarks helps practitioners assess model quality and compare different efficient attention techniques.

Needle-in-a-haystack tests the model’s ability to find a specific fact inserted at a random position in a long document. This benchmark tests basic retrieval capability and attention effectiveness across different context lengths.

Summarization benchmarks evaluate the model’s ability to produce coherent summaries of long documents. Quality metrics assess both factual accuracy and linguistic quality of generated summaries.

RULER is a comprehensive benchmark that includes various long-context tasks: single-document QA, multi-document QA, few-shot learning, and synthetic tasks. The benchmark provides a unified evaluation framework for comparing different long-context techniques.

Combining Techniques

The most effective long-context solutions often combine multiple efficient attention techniques. Understanding how techniques interact enables optimized implementations for specific applications.

Sliding window attention combined with KV cache eviction provides a simple and effective solution. The sliding window limits attention computation, while eviction manages memory usage over very long contexts.

Hierarchical attention with sparse selection at each level can handle extremely long sequences. Lower levels use sliding window for local processing, while higher levels use sparse selection to identify relevant segments for detailed attention.

Dynamic sparse attention with learned importance prediction provides the most adaptive approach. The dynamic selection can focus on the most relevant tokens for each query, potentially achieving better quality than static patterns.

Challenges and Limitations

Efficient long-context techniques face several challenges that limit their applicability in some scenarios.

Training-inference mismatch remains a significant challenge. Models pretrained with full attention may not generalize well to efficient attention patterns, requiring fine-tuning that may not fully recover original quality.

Very long-range dependencies may be lost with limited attention scope. While techniques like hierarchical attention can connect distant tokens, the information flow is constrained compared to full attention.

Hardware utilization patterns may differ from standard attention, potentially limiting efficiency gains on some platforms. Kernel-level optimization is often required to achieve theoretical efficiency benefits.

Future Directions

Research on efficient long-context continues to advance, with several promising directions.

Training-free adaptation methods could enable efficient attention without fine-tuning. These methods would adapt pretrained models to efficient attention patterns through inference-time techniques.

Hardware-software co-design for efficient attention could unlock additional efficiency. Custom kernels and hardware support for sparse and linear attention patterns could significantly improve practical performance.

Unified frameworks that combine multiple efficient attention techniques could provide optimal efficiency across different context lengths and hardware platforms.

Resources

Conclusion

Efficient long-context techniques have transformed what’s possible with language models, enabling practical processing of sequences with 100K or more tokens. From simple sliding window attention to sophisticated dynamic sparse attention, these techniques provide a toolkit for building long-context applications without prohibitive computational costs.

The key to effective long-context deployment is selecting the right techniques for the application. Sliding window attention provides a good starting point for many applications. Hierarchical approaches excel for documents with natural structure. Dynamic sparse attention offers the most adaptability for complex long-range dependencies.

As research continues, efficient long-context techniques will become even more effective, enabling longer contexts and better quality. Understanding these techniques provides a foundation for building the next generation of long-context language model applications.