Introduction
Deploying large language models at scale presents a fundamental challenge: memory management. Traditional approaches to KV cache storage suffer from severe fragmentation, causing GPU memory waste and limiting batch sizes. This directly impacts throughput and increases operational costs.
PagedAttention, introduced by researchers from UC Berkeley, Stanford, and UC San Diego, solves this problem by applying operating system conceptsโspecifically virtual memory and pagingโto transformer inference. The result is up to 24x better throughput compared to Hugging Face’s implementation, enabling more efficient LLM serving.
The Memory Challenge in LLM Inference
KV Cache Problem
During autoregressive generation, transformers must store attention keys and values for all previously processed tokens:
Sequence Length KV Cache Size (7B model, FP16)
---------------- ------------------------------
1K tokens ~128 MB
4K tokens ~512 MB
16K tokens ~2 GB
64K tokens ~8 GB
With multiple concurrent requests, memory consumption multiplies rapidly.
Traditional Approach Limitations
Conventional KV cache management suffers from:
- Continuous Allocation: KV cache requires contiguous memory blocks
- Internal Fragmentation: Unused space within allocated blocks
- External Fragmentation: Difficulty finding large contiguous regions
- Poor Sharing: No efficient mechanism for sharing cache across requests
# Traditional approach - allocate full sequence upfront
class TraditionalKVCache:
def __init__(self, max_length=4096):
# Pre-allocate maximum possible memory
self.cache = torch.zeros(
(2, # K and V
max_length,
num_heads,
head_dim),
dtype=torch.float16,
device='cuda'
)
# Problem: Always allocates max_length, even for short sequences
This approach wastes significant memory when sequences are shorter than max_length.
PagedAttention: OS-Inspired Solution
Core Inspiration
Operating systems solved similar memory fragmentation problems decades ago using paging:
- Pages: Fixed-size memory blocks (typically 4KB)
- Page Tables: Mapping from virtual to physical pages
- Demand Paging: Allocate pages only when needed
PagedAttention applies identical concepts to KV cache management.
How PagedAttention Works
class PagedAttentionKVCache:
"""
PagedAttention-inspired KV cache management
Inspired by vLLM implementation
"""
def __init__(self, block_size=16, num_blocks=1024):
self.block_size = block_size # Tokens per block
self.num_blocks = num_blocks
# Physical page storage - allocate on demand
self.block_tables = {} # seq_id -> list of block indices
# Pre-allocate physical blocks (not tied to any sequence)
self.free_blocks = list(range(num_blocks))
def allocate(self, seq_id, num_tokens):
"""Allocate pages for a new sequence"""
num_blocks_needed = (num_tokens + self.block_size - 1) // self.block_size
if len(self.free_blocks) < num_blocks_needed:
raise MemoryError("No free blocks available")
# Allocate blocks on-demand
allocated_blocks = []
for _ in range(num_blocks_needed):
block_id = self.free_blocks.pop()
allocated_blocks.append(block_id)
self.block_tables[seq_id] = allocated_blocks
return allocated_blocks
def append_tokens(self, seq_id, new_tokens):
"""Append new tokens, allocating new blocks if needed"""
current_blocks = self.block_tables[seq_id]
current_tokens = len(current_blocks) * self.block_size
num_new_blocks_needed = (
current_tokens + len(new_tokens) - 1
) // self.block_size - len(current_blocks)
if num_new_blocks_needed > 0:
# Allocate additional blocks as needed
for _ in range(num_new_blocks_needed):
self.block_tables[seq_id].append(
self.free_blocks.pop()
)
def get_blocks_for_attention(self, seq_id):
"""Retrieve block indices for attention computation"""
return self.block_tables[seq_id]
Attention Computation with Pages
def paged_attention(
query, # [batch, num_heads, head_dim]
key_cache, # Physical block storage
value_cache, # Physical block storage
block_tables, # Mapping: seq -> physical blocks
query_length,
max_seq_length
):
"""
Compute attention using paged KV cache
"""
num_heads = query.size(1)
head_dim = query.size(2)
# Compute query @ key for all positions
# But now we iterate over pages instead of full sequence
attn_weights = torch.zeros(
query.size(0),
num_heads,
query_length,
max_seq_length,
device=query.device
)
for seq_id in range(query.size(0)):
block_indices = block_tables[seq_id]
# Process each page
for block_idx in block_indices:
# Load K/V from this physical block
k_block = key_cache[block_idx] # [num_heads, block_size, head_dim]
v_block = value_cache[block_idx]
# Compute attention for this block
q = query[seq_id] # [num_heads, query_length, head_dim]
# Attention computation
attn_scores = torch.matmul(q, k_block.transpose(-2, -1))
attn_scores = attn_scores / (head_dim ** 0.5)
# Store in corresponding location
block_start = block_idx * k_block.size(1)
block_end = block_start + k_block.size(1)
attn_weights[
seq_id, :, :, block_start:block_end
] = attn_scores
# Softmax and value multiplication
attn_weights = F.softmax(attn_weights, dim=-1)
# Final output computation (simplified)
output = torch.matmul(attn_weights, value_cache)
return output
vLLM Implementation Details
Architecture Overview
vLLM implements PagedAttention with several optimizations:
# vLLM-style attention kernel structure
class PagedAttentionKernel:
BLOCK_SIZE = 16 # Tokens per KV cache block
@staticmethod
def forward(
query, # [num_seqs, num_heads, head_dim]
key, # Pre-computed and stored in pages
value, # Pre-computed and stored in pages
block_tables, # [num_seqs, max_blocks_per_seq]
context_lengths # [num_seqs] - actual sequence lengths
):
"""
Fused attention kernel for PagedAttention
- Combines block lookup, attention computation
- Minimizes memory access
- Utilizes tensor cores for acceleration
"""
# Kernel launches thousands of threads
# Each thread block processes one sequence
# Shared memory stores current query block
# Global memory access patterns optimized for cache
Memory Management
class KVCacheManager:
"""
Complete KV cache memory manager
"""
def __init__(self, gpu_memory: int, block_size: int = 16):
self.block_size = block_size
# Reserve memory for model weights, etc.
# Allocate remaining for KV cache
self.kv_cache_dtype = torch.float16
self.kv_cache_shape = (
2, # K and V
self.num_blocks,
self.num_kv_heads,
self.head_dim,
self.block_size
)
# Allocate pinned CPU memory for swapping
self.cpu_fallback = torch.empty(
self.kv_cache_shape,
dtype=self.kv_cache_dtype,
pin_memory=True
)
# GPU KV cache
self.gpu_kv_cache = torch.empty(
self.kv_cache_shape,
dtype=self.kv_cache_dtype,
device='cuda'
)
def allocate_sequence(self, seq_id: int, max_length: int):
"""Allocate KV cache pages for a new sequence"""
num_blocks = (max_length + self.block_size - 1) // self.block_size
block_table = []
for _ in range(num_blocks):
block_id = self._allocate_block()
block_table.append(block_id)
self.sequence_tables[seq_id] = block_table
return block_table
def evict_blocks(self, num_blocks_needed: int):
"""Evict blocks from least recently used sequences"""
# Sort sequences by last access time
lru_order = sorted(
self.sequence_access_time.items(),
key=lambda x: x[1]
)
evicted_blocks = []
for seq_id, _ in lru_order:
if len(evicted_blocks) >= num_blocks_needed:
break
# Move to CPU if needed for later use
self._move_to_cpu(seq_id)
evicted_blocks.extend(
self.sequence_tables.pop(seq_id)
)
return evicted_blocks
Performance Benefits
Throughput Improvements
| System | Batch Size | Throughput (tok/s) | Improvement |
|---|---|---|---|
| HuggingFace | 16 | 50 | 1x |
| vLLM (PagedAttention) | 16 | 450 | 9x |
| vLLM | 256 | 3,200 | 64x |
| vLLM (Optimized) | 512 | 6,400 | 128x |
Memory Efficiency
PagedAttention dramatically reduces memory waste:
| Scenario | Traditional | PagedAttention | Savings |
|---|---|---|---|
| 1000 sequences, avg 256 tokens | 128 GB allocated | 48 GB used | 62% |
| Single 8K sequence | 8 GB fixed | 4 GB actual | 50% |
| Mixed length requests | 80% fragmentation | <5% fragmentation | 75% |
Advanced Features
Block-Sized KV Cache
# Physical layout of KV cache
# [num_blocks, num_kv_heads, head_size, block_size]
# Example: FP16, 8 heads, 128 head_dim, 16 block_size
# Per block: 8 * 128 * 16 * 2 bytes = 32 KB
Flexible Context Handling
def handle_variable_length(
queries: List[torch.Tensor],
block_tables: Dict[int, List[int]],
context_lengths: List[int]
):
"""Handle different sequence lengths efficiently"""
# Sort by length for optimal GPU utilization
sorted_indices = sorted(
range(len(queries)),
key=lambda i: context_lengths[i],
reverse=True
)
# Process in sorted order for better memory coalescing
for idx in sorted_indices:
query = queries[idx]
blocks = block_tables[idx]
length = context_lengths[idx]
# Compute attention for this sequence
yield compute_paged_attention(
query, blocks, length
)
Practical Implementation with vLLM
Basic Usage
from vllm import LLM, SamplingParams
# Initialize vLLM engine
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Use 4 GPUs
max_num_seqs=256, # Maximum concurrent sequences
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.8,
max_tokens=512,
)
# Generate from multiple prompts
prompts = [
"Write a story about artificial intelligence",
"Explain quantum computing in simple terms",
"What are the benefits of meditation?",
]
# vLLM automatically uses PagedAttention
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Advanced Configuration
from vllm.engine.llm_engine import LLMEngine
from vllm.engine.Args_Model import ModelArgs
from vllm.engine.Args_Sampling import SamplingParams
# Fine-tune PagedAttention behavior
engine = LLMEngine(
model_args=ModelArgs(
model="meta-llama/Llama-2-70b-hf",
kv_cache_dtype="auto", # FP8 kv cache
block_size=16, # Tune block size
),
cache_config=dict(
gpu_memory_utilization=0.9,
swap_space_gb=32, # CPU swap for overflow
),
parallel_config=dict(
tensor_parallel_size=4,
pipeline_parallel_size=2,
),
)
Comparison with Other Techniques
vs. Traditional KV Cache
| Aspect | Traditional | PagedAttention |
|---|---|---|
| Memory Layout | Contiguous | Paged/Non-contiguous |
| Fragmentation | High | Minimal |
| Batch Size | Limited | Much larger |
| Flexibility | Fixed allocation | Dynamic allocation |
Complementary Techniques
PagedAttention works synergistically with:
- FlashAttention: Faster attention computation within blocks
- KV Cache Quantization: Further memory reduction
- Speculative Decoding: Better batch utilization
- Continuous Batching: Improved request scheduling
Conclusion
PagedAttention represents a fundamental breakthrough in LLM serving infrastructure. By applying operating system concepts to transformer inference, it achieves:
- Up to 24x throughput improvement over naive implementations
- Dramatic reduction in memory fragmentation
- Support for much larger batch sizes
- Flexible memory management for variable-length sequences
The technique has become a cornerstone of modern LLM serving infrastructure, enabling more cost-effective and scalable deployments. As models grow larger and serving demands increase, PagedAttention’s importance will only continue to grow.
Resources
- vLLM Paper: Efficient Memory Management for LLM Serving
- vLLM Official GitHub
- PagedAttention Technical Deep Dive
- UC Berkeley Systems Lab
Comments