Skip to main content
โšก Calmops

PagedAttention: Memory Optimization Revolution for LLM Inference

Introduction

Deploying large language models at scale presents a fundamental challenge: memory management. Traditional approaches to KV cache storage suffer from severe fragmentation, causing GPU memory waste and limiting batch sizes. This directly impacts throughput and increases operational costs.

PagedAttention, introduced by researchers from UC Berkeley, Stanford, and UC San Diego, solves this problem by applying operating system conceptsโ€”specifically virtual memory and pagingโ€”to transformer inference. The result is up to 24x better throughput compared to Hugging Face’s implementation, enabling more efficient LLM serving.

The Memory Challenge in LLM Inference

KV Cache Problem

During autoregressive generation, transformers must store attention keys and values for all previously processed tokens:

Sequence Length    KV Cache Size (7B model, FP16)
----------------    ------------------------------
1K tokens           ~128 MB
4K tokens          ~512 MB  
16K tokens          ~2 GB
64K tokens          ~8 GB

With multiple concurrent requests, memory consumption multiplies rapidly.

Traditional Approach Limitations

Conventional KV cache management suffers from:

  1. Continuous Allocation: KV cache requires contiguous memory blocks
  2. Internal Fragmentation: Unused space within allocated blocks
  3. External Fragmentation: Difficulty finding large contiguous regions
  4. Poor Sharing: No efficient mechanism for sharing cache across requests
# Traditional approach - allocate full sequence upfront
class TraditionalKVCache:
    def __init__(self, max_length=4096):
        # Pre-allocate maximum possible memory
        self.cache = torch.zeros(
            (2,  # K and V
             max_length,
             num_heads,
             head_dim),
            dtype=torch.float16,
            device='cuda'
        )
        # Problem: Always allocates max_length, even for short sequences

This approach wastes significant memory when sequences are shorter than max_length.

PagedAttention: OS-Inspired Solution

Core Inspiration

Operating systems solved similar memory fragmentation problems decades ago using paging:

  • Pages: Fixed-size memory blocks (typically 4KB)
  • Page Tables: Mapping from virtual to physical pages
  • Demand Paging: Allocate pages only when needed

PagedAttention applies identical concepts to KV cache management.

How PagedAttention Works

class PagedAttentionKVCache:
    """
    PagedAttention-inspired KV cache management
    Inspired by vLLM implementation
    """
    
    def __init__(self, block_size=16, num_blocks=1024):
        self.block_size = block_size  # Tokens per block
        self.num_blocks = num_blocks
        
        # Physical page storage - allocate on demand
        self.block_tables = {}  # seq_id -> list of block indices
        
        # Pre-allocate physical blocks (not tied to any sequence)
        self.free_blocks = list(range(num_blocks))
        
    def allocate(self, seq_id, num_tokens):
        """Allocate pages for a new sequence"""
        num_blocks_needed = (num_tokens + self.block_size - 1) // self.block_size
        
        if len(self.free_blocks) < num_blocks_needed:
            raise MemoryError("No free blocks available")
        
        # Allocate blocks on-demand
        allocated_blocks = []
        for _ in range(num_blocks_needed):
            block_id = self.free_blocks.pop()
            allocated_blocks.append(block_id)
        
        self.block_tables[seq_id] = allocated_blocks
        return allocated_blocks
    
    def append_tokens(self, seq_id, new_tokens):
        """Append new tokens, allocating new blocks if needed"""
        current_blocks = self.block_tables[seq_id]
        current_tokens = len(current_blocks) * self.block_size
        
        num_new_blocks_needed = (
            current_tokens + len(new_tokens) - 1
        ) // self.block_size - len(current_blocks)
        
        if num_new_blocks_needed > 0:
            # Allocate additional blocks as needed
            for _ in range(num_new_blocks_needed):
                self.block_tables[seq_id].append(
                    self.free_blocks.pop()
                )
    
    def get_blocks_for_attention(self, seq_id):
        """Retrieve block indices for attention computation"""
        return self.block_tables[seq_id]

Attention Computation with Pages

def paged_attention(
    query,           # [batch, num_heads, head_dim]
    key_cache,       # Physical block storage
    value_cache,     # Physical block storage
    block_tables,    # Mapping: seq -> physical blocks
    query_length,
    max_seq_length
):
    """
    Compute attention using paged KV cache
    """
    num_heads = query.size(1)
    head_dim = query.size(2)
    
    # Compute query @ key for all positions
    # But now we iterate over pages instead of full sequence
    attn_weights = torch.zeros(
        query.size(0),
        num_heads,
        query_length,
        max_seq_length,
        device=query.device
    )
    
    for seq_id in range(query.size(0)):
        block_indices = block_tables[seq_id]
        
        # Process each page
        for block_idx in block_indices:
            # Load K/V from this physical block
            k_block = key_cache[block_idx]  # [num_heads, block_size, head_dim]
            v_block = value_cache[block_idx]
            
            # Compute attention for this block
            q = query[seq_id]  # [num_heads, query_length, head_dim]
            
            # Attention computation
            attn_scores = torch.matmul(q, k_block.transpose(-2, -1))
            attn_scores = attn_scores / (head_dim ** 0.5)
            
            # Store in corresponding location
            block_start = block_idx * k_block.size(1)
            block_end = block_start + k_block.size(1)
            attn_weights[
                seq_id, :, :, block_start:block_end
            ] = attn_scores
    
    # Softmax and value multiplication
    attn_weights = F.softmax(attn_weights, dim=-1)
    
    # Final output computation (simplified)
    output = torch.matmul(attn_weights, value_cache)
    
    return output

vLLM Implementation Details

Architecture Overview

vLLM implements PagedAttention with several optimizations:

# vLLM-style attention kernel structure
class PagedAttentionKernel:
    
    BLOCK_SIZE = 16  # Tokens per KV cache block
    
    @staticmethod
    def forward(
        query,           # [num_seqs, num_heads, head_dim]
        key,             # Pre-computed and stored in pages
        value,           # Pre-computed and stored in pages
        block_tables,    # [num_seqs, max_blocks_per_seq]
        context_lengths # [num_seqs] - actual sequence lengths
    ):
        """
        Fused attention kernel for PagedAttention
        - Combines block lookup, attention computation
        - Minimizes memory access
        - Utilizes tensor cores for acceleration
        """
        # Kernel launches thousands of threads
        # Each thread block processes one sequence
        # Shared memory stores current query block
        # Global memory access patterns optimized for cache

Memory Management

class KVCacheManager:
    """
    Complete KV cache memory manager
    """
    
    def __init__(self, gpu_memory: int, block_size: int = 16):
        self.block_size = block_size
        
        # Reserve memory for model weights, etc.
        # Allocate remaining for KV cache
        self.kv_cache_dtype = torch.float16
        self.kv_cache_shape = (
            2,  # K and V
            self.num_blocks,
            self.num_kv_heads,
            self.head_dim,
            self.block_size
        )
        
        # Allocate pinned CPU memory for swapping
        self.cpu_fallback = torch.empty(
            self.kv_cache_shape,
            dtype=self.kv_cache_dtype,
            pin_memory=True
        )
        
        # GPU KV cache
        self.gpu_kv_cache = torch.empty(
            self.kv_cache_shape,
            dtype=self.kv_cache_dtype,
            device='cuda'
        )
        
    def allocate_sequence(self, seq_id: int, max_length: int):
        """Allocate KV cache pages for a new sequence"""
        num_blocks = (max_length + self.block_size - 1) // self.block_size
        
        block_table = []
        for _ in range(num_blocks):
            block_id = self._allocate_block()
            block_table.append(block_id)
        
        self.sequence_tables[seq_id] = block_table
        return block_table
    
    def evict_blocks(self, num_blocks_needed: int):
        """Evict blocks from least recently used sequences"""
        # Sort sequences by last access time
        lru_order = sorted(
            self.sequence_access_time.items(),
            key=lambda x: x[1]
        )
        
        evicted_blocks = []
        for seq_id, _ in lru_order:
            if len(evicted_blocks) >= num_blocks_needed:
                break
                
            # Move to CPU if needed for later use
            self._move_to_cpu(seq_id)
            evicted_blocks.extend(
                self.sequence_tables.pop(seq_id)
            )
        
        return evicted_blocks

Performance Benefits

Throughput Improvements

System Batch Size Throughput (tok/s) Improvement
HuggingFace 16 50 1x
vLLM (PagedAttention) 16 450 9x
vLLM 256 3,200 64x
vLLM (Optimized) 512 6,400 128x

Memory Efficiency

PagedAttention dramatically reduces memory waste:

Scenario Traditional PagedAttention Savings
1000 sequences, avg 256 tokens 128 GB allocated 48 GB used 62%
Single 8K sequence 8 GB fixed 4 GB actual 50%
Mixed length requests 80% fragmentation <5% fragmentation 75%

Advanced Features

Block-Sized KV Cache

# Physical layout of KV cache
# [num_blocks, num_kv_heads, head_size, block_size]
# Example: FP16, 8 heads, 128 head_dim, 16 block_size
# Per block: 8 * 128 * 16 * 2 bytes = 32 KB

Flexible Context Handling

def handle_variable_length(
    queries: List[torch.Tensor],
    block_tables: Dict[int, List[int]],
    context_lengths: List[int]
):
    """Handle different sequence lengths efficiently"""
    
    # Sort by length for optimal GPU utilization
    sorted_indices = sorted(
        range(len(queries)),
        key=lambda i: context_lengths[i],
        reverse=True
    )
    
    # Process in sorted order for better memory coalescing
    for idx in sorted_indices:
        query = queries[idx]
        blocks = block_tables[idx]
        length = context_lengths[idx]
        
        # Compute attention for this sequence
        yield compute_paged_attention(
            query, blocks, length
        )

Practical Implementation with vLLM

Basic Usage

from vllm import LLM, SamplingParams

# Initialize vLLM engine
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,  # Use 4 GPUs
    max_num_seqs=256,        # Maximum concurrent sequences
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=512,
)

# Generate from multiple prompts
prompts = [
    "Write a story about artificial intelligence",
    "Explain quantum computing in simple terms",
    "What are the benefits of meditation?",
]

# vLLM automatically uses PagedAttention
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Advanced Configuration

from vllm.engine.llm_engine import LLMEngine
from vllm.engine.Args_Model import ModelArgs
from vllm.engine.Args_Sampling import SamplingParams

# Fine-tune PagedAttention behavior
engine = LLMEngine(
    model_args=ModelArgs(
        model="meta-llama/Llama-2-70b-hf",
        kv_cache_dtype="auto",  # FP8 kv cache
        block_size=16,          # Tune block size
    ),
    cache_config=dict(
        gpu_memory_utilization=0.9,
        swap_space_gb=32,       # CPU swap for overflow
    ),
    parallel_config=dict(
        tensor_parallel_size=4,
        pipeline_parallel_size=2,
    ),
)

Comparison with Other Techniques

vs. Traditional KV Cache

Aspect Traditional PagedAttention
Memory Layout Contiguous Paged/Non-contiguous
Fragmentation High Minimal
Batch Size Limited Much larger
Flexibility Fixed allocation Dynamic allocation

Complementary Techniques

PagedAttention works synergistically with:

  • FlashAttention: Faster attention computation within blocks
  • KV Cache Quantization: Further memory reduction
  • Speculative Decoding: Better batch utilization
  • Continuous Batching: Improved request scheduling

Conclusion

PagedAttention represents a fundamental breakthrough in LLM serving infrastructure. By applying operating system concepts to transformer inference, it achieves:

  • Up to 24x throughput improvement over naive implementations
  • Dramatic reduction in memory fragmentation
  • Support for much larger batch sizes
  • Flexible memory management for variable-length sequences

The technique has become a cornerstone of modern LLM serving infrastructure, enabling more cost-effective and scalable deployments. As models grow larger and serving demands increase, PagedAttention’s importance will only continue to grow.

Resources

Comments