Introduction
Understanding SQLite’s internal architecture is essential for developers who want to optimize performance, debug issues, and make informed architectural decisions. SQLite’s design prioritizes simplicity, reliability, and performance within a constrained environment, resulting in a unique architecture that differs significantly from traditional client-server databases.
This article explores SQLite’s internal mechanisms, including the B-Tree storage engine, WAL mode implementation, query processing pipeline, and transaction management.
Architecture Overview
Core Components
SQLite consists of several interconnected layers:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Application / Interface โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ SQL Command Processor โ
โ (Parser, Analyzer, Code Gen) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Query Executor โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ B-Tree Module โ
โ (Table B-Tree, Index B-Tree) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Pager Module โ
โ (Cache, I/O, Transactions) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ OS Interface (VFS) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Modules
- Interface Layer: SQLite3 API, CLI, language bindings
- Tokenizer/Parser: SQL statement parsing
- Code Generator: Converts parse tree to bytecode
- Virtual Machine: Executes bytecode operations
- B-Tree Module: Manages table and index storage
- Pager Module: Handles caching, transactions, WAL
- VFS (Virtual File System): Platform-specific I/O
B-Tree Storage Engine
Understanding B-Trees in SQLite
SQLite uses B-Trees for all data storage. There are two types:
- B-Tree (Index B-Tree): Used for indexes, stores only keys and pointers
- B+Tree (Table B-Tree): Used for tables, stores all data in leaf nodes
Page Structure
// Simplified page header structure
struct PageHeader {
uint16_t page_type; // 0x0D = leaf table, 0x0A = interior table
uint16_t first_freeblock; // Offset to first freeblock
uint16_t cell_count; // Number of cells
uint16_t cell_content; // Start of cell content area
uint8_t fragmented_bytes;// Fragmented free bytes
};
Page Types
-- Check page type
-- Each page starts with a 1-byte type indicator
-- 0x0D: Leaf table b-tree page (data table)
-- 0x0A: Interior table b-tree page
-- 0x0C: Leaf index b-tree page
-- 0x02: Interior index b-tree page
-- 0xFF: Pointer map page (for auto-vacuum)
-- 0x0B: Lock-byte page
B+Tree for Tables
-- Database file structure
-- Header page (page 1):
-- - 100-byte database file header
-- - Schema table (sqlite_master)
-- Table structure:
-- CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);
-- B+Tree layout for users table:
-- Interior nodes: [pointer|key|pointer|key|...]
-- Leaf nodes: [payload|payload|payload...]
B-Tree for Indexes
-- Index structure:
-- CREATE INDEX idx_users_email ON users(email);
-- B-Tree layout:
-- Interior nodes: [left_pointer|key1|right_pointer|key2|...]
-- Leaf nodes: [index_entry|index_entry|...]
-- Each index entry contains: key value + rowid pointer
Cell Structure
// Cell format for leaf table page
struct Cell {
uint32_t payload_length; // Bytes in payload
uint32_t rowid; // Integer key (for tables)
uint8_t payload[]; // Actual record data
};
// Record format
struct Record {
varint header_size; // Header size in bytes
varint type1; // Column 1 type
varint type2; // Column 2 type
// ...
uint8_t value1[]; // Column 1 value
uint8_t value2[]; // Column 2 value
// ...
};
Pager Module
Responsibilities
The Pager module is the bridge between B-Trees and the filesystem:
- Page Cache: In-memory cache of database pages
- Transaction Management: BEGIN, COMMIT, ROLLBACK
- Journaling: Write-ahead logging or rollback journals
- Locking: Database-level and page-level locking
- Recovery: Crash recovery and hot journals
Page Cache
# Pager operation conceptual flow
# Read page:
# 1. Check if page in cache
# 2. If not, read from disk
# 3. Add to cache, evict if full
# Write page:
# 1. Mark page as dirty
# 2. Write to journal (if in transaction)
# 3. Write to database file
# 4. Mark clean after sync
Journal Modes
-- DELETE mode (default, legacy)
-- On commit: delete rollback journal
-- TRUNCATE mode
-- On commit: truncate journal to zero bytes
-- PERSIST mode
-- On commit: leave journal header, mark invalid
-- MEMORY mode
-- Keep journal in memory (for read-only/single user)
-- WAL mode (recommended)
-- Write-ahead log for better concurrency
Write-Ahead Logging (WAL)
How WAL Works
WAL mode changes how transactions are recorded:
- Write to WAL: Modifications go to WAL instead of database
- Concurrent Reads: Readers access database without locking
- Checkpoint: WAL is periodically merged back to database
- Recovery: On crash, replay WAL to recover changes
-- Enable WAL mode
PRAGMA journal_mode = WAL;
-- WAL file structure:
-- 1. WAL header (32 bytes)
-- 2. Frame header + page for each commit
-- - Frame header: salt, checksum, page number, size
-- - Page: 4KB database page
-- Checkpoint operation:
-- - Reader: Access database, ignore uncommitted WAL frames
-- - Checkpointer: Copy committed frames to database
-- - Cleanup: Truncate WAL after checkpoint
WAL Frame Format
// WAL frame header
struct WalFrame {
uint32_t page_number; // Database page number
uint32_t size; // Page size for this frame
uint32_t salt0; // Random salt, same for all frames
uint32_t salt1; // Different for each frame
uint32_t checksum0; // Checksum of first 24 bytes
uint32_t checksum1; // Checksum of page content
uint8_t page[4096]; // Actual database page
};
WAL Checkpoint
-- Checkpoint types:
PRAGMA wal_checkpoint(PASSIVE); -- Non-blocking, best effort
PRAGMA wal_checkpoint(FULL); -- Blocking checkpoint
PRAGMA wal_checkpoint(RESTART); -- Like FULL, also restart read
PRAGMA wal_checkpoint(TRUNCATE); -- Truncate WAL file after
-- Auto-checkpoint (default: 1000 pages)
PRAGMA wal_autocheckpoint = 1000;
Read Consistency in WAL
# Reader behavior in WAL mode
# Each reader sees:
# 1. Last committed frame before their start
# 2. No uncommitted or subsequent frames
# 3. Consistent snapshot of database
# Implementation:
# - Read begin: get end-of-WAL frame number (read marker)
# - During read: use database pages + committed WAL frames
# - Read end: no cleanup needed
Query Processing Pipeline
Execution Flow
SQL Query
โ
โผ
โโโโโโโโโโโโโโโ
โ Tokenizer โ Split SQL into tokens
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Parser โ Build parse tree (syntax checking)
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Analyzer โ Resolve names, determine types
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Code Gen โ Generate bytecode (VM opcodes)
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ VM/Executorโ Execute bytecode on B-Trees
โโโโโโโโโโโโโโโ
โ
โผ
Result Set
Query Plan
-- Simple query plan
EXPLAIN QUERY PLAN
SELECT * FROM users WHERE id = 1;
-- Output:
-- |--SEARCH users USING INTEGER PRIMARY KEY (id=1)
-- Join query plan
EXPLAIN QUERY PLAN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active';
-- Output:
-- |--SEARCH o USING INDEX idx_orders_user (user_id=?)
-- |--SEARCH u USING INTEGER PRIMARY KEY (rowid=?)
-- |--SCAN u USING COVERING INDEX idx_status
Virtual Machine Opcodes
// SQLite VM opcodes (simplified)
enum {
OP_Goto = 1, // Jump to instruction
OP_OpenRead = 2, // Open table/index for reading
OP_OpenWrite = 3, // Open table for writing
OP_String8 = 4, // Load string constant
OP_Integer = 5, // Load integer constant
OP_IsNull = 6, // Check if value is NULL
OP_Column = 7, // Get column from cursor
OP_ResultRow = 8, // Return result row
OP_MakeRecord = 9,// Build record from values
OP_Insert = 10, // Insert row into table
OP_Transaction = 11, // Start transaction
OP_Halt = 12, // End program
// ... hundreds more
};
Transaction Management and MVCC
ACID Properties
SQLite implements full ACID transactions:
- Atomicity: All operations in transaction succeed or all fail
- Consistency: Database moves from one valid state to another
- Isolation: Concurrent transactions don’t interfere
- Durability: Committed data survives crashes
Locking Levels
-- SQLite lock states (progressive):
-- UNLOCKED -> SHARED -> RESERVED -> PENDING -> EXCLUSIVE
-- SHARED: Read access, multiple allowed
-- RESERVED: About to write, one allowed, readers OK
-- PENDING: Waiting for readers to release, new readers blocked
-- EXCLUSIVE: Write access, exclusive
-- Deadlock prevention:
-- Only one writer at a time
-- Readers don't block writers (in WAL mode)
Write Transaction Flow
# Simplified write transaction
def begin_write():
acquire_shared_lock() # Allow concurrent reads
create_journal_file() # Create/clear journal
def write_page(page_data):
write_to_journal(page_data) # Log changes first
mark_page_dirty(page_data)
def commit():
sync_journal_to_disk() # Ensure journal on disk
write_dirty_pages() # Update main database
delete_journal() # Remove journal (or checkpoint WAL)
release_lock()
Rollback Mechanism
-- Rollback journal structure:
-- 1. Journal header (first sector)
-- - Database page count
-- - Random nonce
-- - Initial database size
-- 2. Original page images (for each modified page)
-- On crash before commit:
-- 1. Read journal header
-- 2. For each page in journal:
-- - Write original page back to database
-- 3. Delete journal
-- On crash after commit but before checkpoint:
-- 1. Database contains committed changes
-- 2. WAL contains changes (for recovery)
-- 3. On next open: replay WAL
Memory Management
Page Cache
-- Configure page cache size
PRAGMA cache_size = -64000; -- 64MB (negative = KB)
-- Default: -2000 (2MB)
-- Maximum depends on available memory
-- Pages are shared between connections
-- Default 2000 pages ~7.8MB at 4KB page size
Memory-Mapped I/O
-- Enable memory-mapped I/O
PRAGMA mmap_size = 268435456; -- 256MB
-- Default: 0 (disabled)
-- Maximum: 2GB on 64-bit systems
-- Works with WAL mode
-- Not supported on all platforms
Temporary Storage
-- Temp store configuration
PRAGMA temp_store = FILE; -- DEFAULT, FILE, or MEMORY
-- For in-memory temp tables (creates # tables):
PRAGMA temp_store = MEMORY;
-- Default temp store directory:
-- Linux: /tmp
-- Windows: TEMP environment variable
Vacuum and Auto-Vacuum
How Vacuum Works
-- Manual vacuum (full)
VACUUM;
-- Auto-vacuum (incremental)
PRAGMA auto_vacuum = INCREMENTAL;
PRAGMA incremental_vacuum(1000); -- Vacuum 1000 pages
-- Auto-vacuum modes:
-- NONE (default for small databases)
-- FULL (traditional auto-vacuum)
-- INCREMENTAL (background vacuum)
Vacuum Process
-- VACUUM operation:
-- 1. Begin exclusive transaction
-- 2. Create temporary database
-- 3. Export all pages to temp (reclaiming space)
-- 4. Rename temp to original
-- 5. Commit
-- Requires 2x disk space temporarily
-- Can be slow for large databases
Index Implementation
B-Tree Index Structure
-- Index in SQLite is a B-Tree (not B+Tree)
-- Interior: [pointer|key|pointer|key|...]
-- Key = column values + rowid
-- Index leaf entry:
-- 1. Payload length
-- 2. Rowid
-- 3. Column values (key)
-- Example:
-- CREATE INDEX idx_name ON users(name, email);
-- Index structure:
-- [name|email|rowid] -> points to table row
Covering Index
-- Covering index includes all queried columns
CREATE INDEX idx_covering ON users(status, name, email);
-- Query uses only index, no table lookup
EXPLAIN QUERY PLAN
SELECT name, email FROM users WHERE status = 'active';
-- Output should show "USING COVERING INDEX"
Partial Index
-- Index only subset of rows
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Index automatically maintained only for matching rows
-- Smaller, faster index
-- But can't be used for queries without WHERE status = 'active'
Cost-Based Query Optimizer
Query Planner
SQLite uses a cost-based optimizer:
-- Planner considers:
-- 1. Number of index entries
-- 2. Estimated rows returned
-- 3. Index selectivity
-- 4. Available indexes
-- Statistics gathering:
ANALYZE; -- Scan tables, build statistics
-- Statistics stored in sqlite_stat1, sqlite_stat4
SELECT * FROM sqlite_stat1;
Optimizer Flags
-- Disable optimizations (for debugging)
PRAGMA optimize = OFF;
-- Query planning hints
-- Use INDEXED BY clause (rarely used):
SELECT * FROM users INDEXED BY idx_name WHERE name = 'John';
-- Use NOT INDEXED (force table scan):
SELECT * FROM users NOT INDEXED WHERE name LIKE '%John%';
Best Practices for Performance
Understanding Query Plans
-- Always EXPLAIN QUERY PLAN
EXPLAIN QUERY PLAN
SELECT * FROM orders
WHERE customer_id = 123
ORDER BY created_at DESC;
-- Look for:
-- SCAN (full table scan - bad)
-- SEARCH (indexed access - good)
-- USING COVERING INDEX (optimal)
-- USING TEMP B-TREE (sorting - consider index)
Index Design
-- Create indexes for WHERE clauses
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Composite indexes: order matters
CREATE INDEX idx_orders_cust_date ON orders(customer_id, created_at DESC);
-- First column: equality searches
-- Second column: range/sort searches
-- Use INCLUDE for covering indexes
CREATE INDEX idx_lookup
ON orders(customer_id)
INCLUDE (order_id, total, status);
Common Pitfalls
Pitfall 1: Unoptimized Queries
-- Bad: Function on indexed column
SELECT * FROM users WHERE LOWER(name) = 'john';
-- Good: Use column directly
SELECT * FROM users WHERE name = 'JOHN' COLLATE NOCASE;
-- Or create functional index
CREATE INDEX idx_lower_name ON users((LOWER(name)));
Pitfall 2: Inefficient LIKE Patterns
-- Bad: Leading wildcard
SELECT * FROM users WHERE name LIKE '%son';
-- Good: Prefix match
SELECT * FROM users WHERE name LIKE 'son%';
-- Use FTS for full-text search
CREATE VIRTUAL TABLE docs USING fts5(content);
Pitfall 3: Missing Statistics
-- Always run ANALYZE after:
-- - Major data changes
-- - Index creation
-- - Database upgrade
-- - Before query optimization
ANALYZE;
Resources
Conclusion
SQLite’s internal architecture represents a carefully balanced design that prioritizes simplicity, reliability, and performance. Understanding these internals helps developers make better decisions about indexing, transaction management, and query optimization.
In the next article, we’ll explore recent SQLite developments and trends in 2025-2026, including new features and emerging use cases.
Comments