SQLite Internals: Architecture, B-Tree, and Query Processing

Introduction

Understanding SQLite’s internal architecture is essential for developers who want to optimize performance, debug issues, and make informed architectural decisions. SQLite’s design prioritizes simplicity, reliability, and performance within a constrained environment, resulting in a unique architecture that differs significantly from traditional client-server databases.

This article explores SQLite’s internal mechanisms, including the B-Tree storage engine, WAL mode implementation, query processing pipeline, and transaction management.

Architecture Overview

Core Components

SQLite consists of several interconnected layers:

┌─────────────────────────────────────┐
│     Application / Interface         │
├─────────────────────────────────────┤
│     SQL Command Processor           │
│     (Parser, Analyzer, Code Gen)    │
├─────────────────────────────────────┤
│     Query Executor                  │
├─────────────────────────────────────┤
│     B-Tree Module                   │
│     (Table B-Tree, Index B-Tree)    │
├─────────────────────────────────────┤
│     Pager Module                    │
│     (Cache, I/O, Transactions)      │
├─────────────────────────────────────┤
│     OS Interface (VFS)              │
└─────────────────────────────────────┘

Key Modules

Interface Layer: SQLite3 API, CLI, language bindings
Tokenizer/Parser: SQL statement parsing
Code Generator: Converts parse tree to bytecode
Virtual Machine: Executes bytecode operations
B-Tree Module: Manages table and index storage
Pager Module: Handles caching, transactions, WAL
VFS (Virtual File System): Platform-specific I/O

B-Tree Storage Engine

Understanding B-Trees in SQLite

SQLite uses B-Trees for all data storage. There are two types:

B-Tree (Index B-Tree): Used for indexes, stores only keys and pointers
B+Tree (Table B-Tree): Used for tables, stores all data in leaf nodes

Page Structure

// Simplified page header structure
struct PageHeader {
    uint16_t page_type;      // 0x0D = leaf table, 0x0A = interior table
    uint16_t first_freeblock; // Offset to first freeblock
    uint16_t cell_count;      // Number of cells
    uint16_t cell_content;    // Start of cell content area
    uint8_t  fragmented_bytes;// Fragmented free bytes
};

Page Types

-- Check page type
-- Each page starts with a 1-byte type indicator
-- 0x0D: Leaf table b-tree page (data table)
-- 0x0A: Interior table b-tree page
-- 0x0C: Leaf index b-tree page
-- 0x02: Interior index b-tree page
-- 0xFF: Pointer map page (for auto-vacuum)
-- 0x0B: Lock-byte page

B+Tree for Tables

-- Database file structure
-- Header page (page 1):
--   - 100-byte database file header
--   - Schema table (sqlite_master)

-- Table structure:
-- CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);

-- B+Tree layout for users table:
-- Interior nodes: [pointer|key|pointer|key|...]
-- Leaf nodes: [payload|payload|payload...]

B-Tree for Indexes

-- Index structure:
-- CREATE INDEX idx_users_email ON users(email);

-- B-Tree layout:
-- Interior nodes: [left_pointer|key1|right_pointer|key2|...]
-- Leaf nodes: [index_entry|index_entry|...]
-- Each index entry contains: key value + rowid pointer

Cell Structure

// Cell format for leaf table page
struct Cell {
    uint32_t payload_length;  // Bytes in payload
    uint32_t rowid;          // Integer key (for tables)
    uint8_t payload[];      // Actual record data
};

// Record format
struct Record {
    varint header_size;      // Header size in bytes
    varint type1;            // Column 1 type
    varint type2;            // Column 2 type
    // ...
    uint8_t value1[];        // Column 1 value
    uint8_t value2[];       // Column 2 value
    // ...
};

Responsibilities

The Pager module is the bridge between B-Trees and the filesystem:

Page Cache: In-memory cache of database pages
Transaction Management: BEGIN, COMMIT, ROLLBACK
Journaling: Write-ahead logging or rollback journals
Locking: Database-level and page-level locking
Recovery: Crash recovery and hot journals

Page Cache

# Pager operation conceptual flow

# Read page:
# 1. Check if page in cache
# 2. If not, read from disk
# 3. Add to cache, evict if full

# Write page:
# 1. Mark page as dirty
# 2. Write to journal (if in transaction)
# 3. Write to database file
# 4. Mark clean after sync

Journal Modes

-- DELETE mode (default, legacy)
-- On commit: delete rollback journal

-- TRUNCATE mode
-- On commit: truncate journal to zero bytes

-- PERSIST mode
-- On commit: leave journal header, mark invalid

-- MEMORY mode
-- Keep journal in memory (for read-only/single user)

-- WAL mode (recommended)
-- Write-ahead log for better concurrency

Write-Ahead Logging (WAL)

How WAL Works

WAL mode changes how transactions are recorded:

Write to WAL: Modifications go to WAL instead of database
Concurrent Reads: Readers access database without locking
Checkpoint: WAL is periodically merged back to database
Recovery: On crash, replay WAL to recover changes

-- Enable WAL mode
PRAGMA journal_mode = WAL;

-- WAL file structure:
-- 1. WAL header (32 bytes)
-- 2. Frame header + page for each commit
--    - Frame header: salt, checksum, page number, size
--    - Page: 4KB database page

-- Checkpoint operation:
-- - Reader: Access database, ignore uncommitted WAL frames
-- - Checkpointer: Copy committed frames to database
-- - Cleanup: Truncate WAL after checkpoint

WAL Frame Format

// WAL frame header
struct WalFrame {
    uint32_t page_number;    // Database page number
    uint32_t size;           // Page size for this frame
    uint32_t salt0;          // Random salt, same for all frames
    uint32_t salt1;          // Different for each frame
    uint32_t checksum0;      // Checksum of first 24 bytes
    uint32_t checksum1;      // Checksum of page content
    uint8_t page[4096];      // Actual database page
};

WAL Checkpoint

-- Checkpoint types:
PRAGMA wal_checkpoint(PASSIVE);  -- Non-blocking, best effort
PRAGMA wal_checkpoint(FULL);     -- Blocking checkpoint
PRAGMA wal_checkpoint(RESTART);  -- Like FULL, also restart read
PRAGMA wal_checkpoint(TRUNCATE); -- Truncate WAL file after

-- Auto-checkpoint (default: 1000 pages)
PRAGMA wal_autocheckpoint = 1000;

Read Consistency in WAL

# Reader behavior in WAL mode

# Each reader sees:
# 1. Last committed frame before their start
# 2. No uncommitted or subsequent frames
# 3. Consistent snapshot of database

# Implementation:
# - Read begin: get end-of-WAL frame number (read marker)
# - During read: use database pages + committed WAL frames
# - Read end: no cleanup needed

Query Processing Pipeline

Execution Flow

SQL Query
    │
    ▼
┌─────────────┐
│  Tokenizer  │  Split SQL into tokens
└─────────────┘
    │
    ▼
┌─────────────┐
│   Parser    │  Build parse tree (syntax checking)
└─────────────┘
    │
    ▼
┌─────────────┐
│   Analyzer  │  Resolve names, determine types
└─────────────┘
    │
    ▼
┌─────────────┐
│  Code Gen   │  Generate bytecode (VM opcodes)
└─────────────┘
    │
    ▼
┌─────────────┐
│  VM/Executor│  Execute bytecode on B-Trees
└─────────────┘
    │
    ▼
Result Set

Query Plan

-- Simple query plan
EXPLAIN QUERY PLAN 
SELECT * FROM users WHERE id = 1;

-- Output:
-- |--SEARCH users USING INTEGER PRIMARY KEY (id=1)

-- Join query plan
EXPLAIN QUERY PLAN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active';

-- Output:
-- |--SEARCH o USING INDEX idx_orders_user (user_id=?)
-- |--SEARCH u USING INTEGER PRIMARY KEY (rowid=?)
-- |--SCAN u USING COVERING INDEX idx_status

Virtual Machine Opcodes

// SQLite VM opcodes (simplified)
enum {
    OP_Goto = 1,        // Jump to instruction
    OP_OpenRead = 2,   // Open table/index for reading
    OP_OpenWrite = 3,  // Open table for writing
    OP_String8 = 4,    // Load string constant
    OP_Integer = 5,   // Load integer constant
    OP_IsNull = 6,    // Check if value is NULL
    OP_Column = 7,    // Get column from cursor
    OP_ResultRow = 8, // Return result row
    OP_MakeRecord = 9,// Build record from values
    OP_Insert = 10,   // Insert row into table
    OP_Transaction = 11, // Start transaction
    OP_Halt = 12,     // End program
    // ... hundreds more
};

Transaction Management and MVCC

ACID Properties

SQLite implements full ACID transactions:

Atomicity: All operations in transaction succeed or all fail
Consistency: Database moves from one valid state to another
Isolation: Concurrent transactions don’t interfere
Durability: Committed data survives crashes

Locking Levels

-- SQLite lock states (progressive):
-- UNLOCKED -> SHARED -> RESERVED -> PENDING -> EXCLUSIVE

-- SHARED: Read access, multiple allowed
-- RESERVED: About to write, one allowed, readers OK
-- PENDING: Waiting for readers to release, new readers blocked
-- EXCLUSIVE: Write access, exclusive

-- Deadlock prevention:
-- Only one writer at a time
-- Readers don't block writers (in WAL mode)

Write Transaction Flow

# Simplified write transaction

def begin_write():
    acquire_shared_lock()      # Allow concurrent reads
    create_journal_file()      # Create/clear journal
    
def write_page(page_data):
    write_to_journal(page_data)  # Log changes first
    mark_page_dirty(page_data)
    
def commit():
    sync_journal_to_disk()    # Ensure journal on disk
    write_dirty_pages()       # Update main database
    delete_journal()          # Remove journal (or checkpoint WAL)
    release_lock()

Rollback Mechanism

-- Rollback journal structure:
-- 1. Journal header (first sector)
--    - Database page count
--    - Random nonce
--    - Initial database size
-- 2. Original page images (for each modified page)

-- On crash before commit:
-- 1. Read journal header
-- 2. For each page in journal:
--    - Write original page back to database
-- 3. Delete journal

-- On crash after commit but before checkpoint:
-- 1. Database contains committed changes
-- 2. WAL contains changes (for recovery)
-- 3. On next open: replay WAL

Memory Management

Page Cache

-- Configure page cache size
PRAGMA cache_size = -64000;  -- 64MB (negative = KB)

-- Default: -2000 (2MB)
-- Maximum depends on available memory

-- Pages are shared between connections
-- Default 2000 pages ~7.8MB at 4KB page size

Memory-Mapped I/O

-- Enable memory-mapped I/O
PRAGMA mmap_size = 268435456;  -- 256MB

-- Default: 0 (disabled)
-- Maximum: 2GB on 64-bit systems

-- Works with WAL mode
-- Not supported on all platforms

Temporary Storage

-- Temp store configuration
PRAGMA temp_store = FILE;  -- DEFAULT, FILE, or MEMORY

-- For in-memory temp tables (creates # tables):
PRAGMA temp_store = MEMORY;

-- Default temp store directory:
-- Linux: /tmp
-- Windows: TEMP environment variable

Vacuum and Auto-Vacuum

How Vacuum Works

-- Manual vacuum (full)
VACUUM;

-- Auto-vacuum (incremental)
PRAGMA auto_vacuum = INCREMENTAL;
PRAGMA incremental_vacuum(1000);  -- Vacuum 1000 pages

-- Auto-vacuum modes:
-- NONE (default for small databases)
-- FULL (traditional auto-vacuum)
-- INCREMENTAL (background vacuum)

Vacuum Process

-- VACUUM operation:
-- 1. Begin exclusive transaction
-- 2. Create temporary database
-- 3. Export all pages to temp (reclaiming space)
-- 4. Rename temp to original
-- 5. Commit

-- Requires 2x disk space temporarily
-- Can be slow for large databases

Index Implementation

B-Tree Index Structure

-- Index in SQLite is a B-Tree (not B+Tree)
-- Interior: [pointer|key|pointer|key|...]
-- Key = column values + rowid

-- Index leaf entry:
-- 1. Payload length
-- 2. Rowid
-- 3. Column values (key)

-- Example:
-- CREATE INDEX idx_name ON users(name, email);

-- Index structure:
-- [name|email|rowid] -> points to table row

Covering Index

-- Covering index includes all queried columns
CREATE INDEX idx_covering ON users(status, name, email);

-- Query uses only index, no table lookup
EXPLAIN QUERY PLAN 
SELECT name, email FROM users WHERE status = 'active';

-- Output should show "USING COVERING INDEX"

Partial Index

-- Index only subset of rows
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';

-- Index automatically maintained only for matching rows
-- Smaller, faster index
-- But can't be used for queries without WHERE status = 'active'

Cost-Based Query Optimizer

Query Planner

SQLite uses a cost-based optimizer:

-- Planner considers:
-- 1. Number of index entries
-- 2. Estimated rows returned
-- 3. Index selectivity
-- 4. Available indexes

-- Statistics gathering:
ANALYZE;  -- Scan tables, build statistics

-- Statistics stored in sqlite_stat1, sqlite_stat4
SELECT * FROM sqlite_stat1;

Optimizer Flags

-- Disable optimizations (for debugging)
PRAGMA optimize = OFF;

-- Query planning hints
-- Use INDEXED BY clause (rarely used):
SELECT * FROM users INDEXED BY idx_name WHERE name = 'John';

-- Use NOT INDEXED (force table scan):
SELECT * FROM users NOT INDEXED WHERE name LIKE '%John%';

Best Practices for Performance

Understanding Query Plans

-- Always EXPLAIN QUERY PLAN
EXPLAIN QUERY PLAN 
SELECT * FROM orders 
WHERE customer_id = 123 
ORDER BY created_at DESC;

-- Look for:
-- SCAN (full table scan - bad)
-- SEARCH (indexed access - good)
-- USING COVERING INDEX (optimal)
-- USING TEMP B-TREE (sorting - consider index)

Index Design

-- Create indexes for WHERE clauses
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- Composite indexes: order matters
CREATE INDEX idx_orders_cust_date ON orders(customer_id, created_at DESC);

-- First column: equality searches
-- Second column: range/sort searches

-- Use INCLUDE for covering indexes
CREATE INDEX idx_lookup 
ON orders(customer_id) 
INCLUDE (order_id, total, status);

Common Pitfalls

Pitfall 1: Unoptimized Queries

-- Bad: Function on indexed column
SELECT * FROM users WHERE LOWER(name) = 'john';

-- Good: Use column directly
SELECT * FROM users WHERE name = 'JOHN' COLLATE NOCASE;

-- Or create functional index
CREATE INDEX idx_lower_name ON users((LOWER(name)));

Pitfall 2: Inefficient LIKE Patterns

-- Bad: Leading wildcard
SELECT * FROM users WHERE name LIKE '%son';

-- Good: Prefix match
SELECT * FROM users WHERE name LIKE 'son%';

-- Use FTS for full-text search
CREATE VIRTUAL TABLE docs USING fts5(content);

Pitfall 3: Missing Statistics

-- Always run ANALYZE after:
-- - Major data changes
-- - Index creation
-- - Database upgrade
-- - Before query optimization

ANALYZE;

Resources

Conclusion

SQLite’s internal architecture represents a carefully balanced design that prioritizes simplicity, reliability, and performance. Understanding these internals helps developers make better decisions about indexing, transaction management, and query optimization.

In the next article, we’ll explore recent SQLite developments and trends in 2025-2026, including new features and emerging use cases.

Introduction

Architecture Overview

Core Components

Key Modules

B-Tree Storage Engine

Understanding B-Trees in SQLite

Page Structure

Page Types

B+Tree for Tables

B-Tree for Indexes

Cell Structure

Pager Module

Responsibilities

Page Cache

Journal Modes

Write-Ahead Logging (WAL)

How WAL Works

WAL Frame Format

WAL Checkpoint

Read Consistency in WAL

Query Processing Pipeline

Execution Flow

Query Plan

Virtual Machine Opcodes

Transaction Management and MVCC

ACID Properties

Locking Levels

Write Transaction Flow

Rollback Mechanism

Memory Management

Page Cache

Memory-Mapped I/O

Temporary Storage

Vacuum and Auto-Vacuum

How Vacuum Works

Vacuum Process

Index Implementation

B-Tree Index Structure

Covering Index

Partial Index

Cost-Based Query Optimizer

Query Planner

Optimizer Flags

Best Practices for Performance

Understanding Query Plans

Index Design

Common Pitfalls

Pitfall 1: Unoptimized Queries

Pitfall 2: Inefficient LIKE Patterns

Pitfall 3: Missing Statistics

Resources

Conclusion

Comments