B-Trees and B+Trees: The Backbone of Databases

Introduction

Every time you run a SELECT query on a table with millions of rows, the response comes back in milliseconds. That speed is not magic—it is the result of a carefully designed data structure called the B+Tree working behind the scenes.

B-Trees and their close variant B+Trees are the foundation of nearly every relational database, document store, and file system in production today. PostgreSQL, MySQL, MongoDB, SQLite, and even operating systems like Linux (ext4) and Windows (NTFS) all rely on some form of B-Tree for indexing.

This article explains what B-Trees are, how they work, why B+Trees became the standard for databases, and how they compare with alternative index structures like LSM-Trees. You will also see concrete Python implementations, SQL index examples, and a detailed comparison table to help you choose the right index structure for your workload.

The Problem with Binary Trees on Disk

Binary search trees are elegant in memory. Insert, delete, and search all run in O(log n) time. A balanced BST with one million entries has a height of about 20. In RAM, traversing 20 pointers costs negligible time.

On disk, the same traversal is catastrophic. Each node access requires a random disk I/O, which takes roughly 10 milliseconds on a spinning hard drive. Twenty disk seeks means 200 milliseconds for a single lookup—unacceptable for any interactive application. Even on SSDs, where random reads cost ~100 microseconds, twenty reads add up to 2 milliseconds per query, limiting throughput to a few hundred queries per second.

The core problem is tree height. Binary trees are tall and skinny—each node holds one key and two child pointers. Databases need the opposite: short and fat trees where a single node access retrieves hundreds of keys at once.

Visual:

Binary tree (height=5, 31 keys):
        [k]
       /   \
    [k]     [k]
   /   \   /   \
 [k]  [k] [k] [k]
 / \  / \ / \ / \
kk kk kk kk kk kk kk kk
One key per node → deep tree → many I/Os

B-Trees solve this by packing many keys into each node, reducing tree height to 3 or 4 levels even for billions of keys.

B-Tree Structure

A B-Tree of minimum degree t obeys these invariants:

Every node except the root contains at least t-1 keys and at most 2t-1 keys.
A non-leaf node with k keys has k+1 children.
All leaves appear at the same depth—the tree is perfectly balanced.
Keys within a node are stored in sorted order.

The branching factor (the maximum number of children per node) is 2t. With t=100, a node can hold up to 199 keys and 200 children. A three-level B-Tree with t=100 holds approximately 200 × 200 × 200 = 8 million keys.

Node structure:

Internal node (t=3, max keys=5):
┌──────────────────────────────────┐
│ [10] [25] [40] [55] [70]         │ keys
│  ▲    ▲    ▲    ▲    ▲     ▲      │ pointers to children
└──────────────────────────────────┘
Child pointers: 6 total (k+1)

Leaf node (t=3):
┌──────────────────────────────────┐
│ [5] [8] [12] [15] [22]           │ keys (and data pointers)
│ next → ○                         │ sibling pointer
└──────────────────────────────────┘

Python: B-Tree Node

class BTreeNode:
    def __init__(self, leaf=False):
        self.leaf = leaf
        self.keys = []        # sorted list of keys
        self.children = []    # child pointers (length = len(keys) + 1)
        self.parent = None    # useful for splits and merges

    def is_full(self, t):
        return len(self.keys) == 2 * t - 1

B-Tree Operations

Search

Search starts at the root and descends. At each node, binary search locates the correct child pointer. The process requires exactly h node accesses, where h is the tree height.

Python: B-Tree Search

def btree_search(node, key):
    """Return True if key exists in the subtree rooted at node."""
    i = 0
    while i < len(node.keys) and key > node.keys[i]:
        i += 1
    if i < len(node.keys) and key == node.keys[i]:
        return True
    if node.leaf:
        return False
    return btree_search(node.children[i], key)

Performance: O(log n) comparisons, but only O(log_{t} n) node accesses. With t=100 and 1 billion keys, height ≈ log_{200}(1e9) ≈ 4.

Insertion

Insertion always targets a leaf node. If the leaf has space (fewer than 2t-1 keys), the new key is inserted in sorted position. If the leaf is full, a split occurs:

The full node splits into two nodes, each with t-1 keys.
The median key moves up to the parent.
If the parent is also full, the split propagates upward.
If the root splits, a new root is created and height increases by 1.

Visual: Insert 30 into a node [20, 40, 60] (t=2):

Before:
      [50]
     /    \
[20,40,60] [80,100]

Insert 30 → leaf [20,40,60] is full → split:

Step 1: Split leaf into [20] and [40,60], median=40 moves up
Step 2: Parent [50] accepts 40

After:
      [40, 50]
     /    |    \
  [20]  [60]  [80,100]
(30 was inserted into [20] → [20,30])

Python: B-Tree Insertion

def btree_insert(root, key, t):
    """Insert key into B-Tree rooted at root. Returns new root."""
    if root is None:
        root = BTreeNode(leaf=True)
        root.keys.append(key)
        return root

    if root.is_full(t):
        # Create a new root, split the old root
        new_root = BTreeNode(leaf=False)
        new_root.children.append(root)
        root.parent = new_root
        split_child(new_root, 0, t)
        insert_non_full(new_root, key)
        return new_root

    insert_non_full(root, key)
    return root


def split_child(parent, i, t):
    """Split the i-th child of parent, which is full."""
    child = parent.children[i]
    mid = t - 1

    # New node gets the second half of keys
    new_node = BTreeNode(leaf=child.leaf)
    new_node.keys = child.keys[mid + 1:]
    if not child.leaf:
        new_node.children = child.children[mid + 1:]

    # Keep first half in child
    child.keys = child.keys[:mid]
    if not child.leaf:
        child.children = child.children[:mid + 1]

    # Insert median key into parent
    parent.keys.insert(i, child.keys[mid])
    parent.children.insert(i + 1, new_node)
    new_node.parent = parent
    child.parent = parent


def insert_non_full(node, key):
    """Insert key into a non-full node."""
    i = len(node.keys) - 1

    if node.leaf:
        while i >= 0 and key < node.keys[i]:
            i -= 1
        node.keys.insert(i + 1, key)
        return

    while i >= 0 and key < node.keys[i]:
        i -= 1
    i += 1

    if node.children[i].is_full(t):
        split_child(node, i, t)
        if key > node.keys[i]:
            i += 1
    insert_non_full(node.children[i], key)

Deletion

Deletion is the most intricate B-Tree operation. The algorithm must ensure every node stays at least half-full:

Key is in a leaf: Remove it directly.
Key is in an internal node: Replace it with the predecessor (or successor) from a leaf, then delete that leaf key.
Node would underflow: Borrow a key from a sibling (rotation), or merge with a sibling.

def btree_delete(node, key, t):
    """Delete key from B-Tree. Pseudocode for the recursive structure."""
    if node is None:
        return False

    i = find_key_index(node, key)

    if i < len(node.keys) and node.keys[i] == key:
        if node.leaf:
            # Case 1: key in leaf
            node.keys.pop(i)
            return True
        # Case 2: key in internal node
        # Find predecessor in left child subtree
        pred = get_predecessor(node.children[i])
        node.keys[i] = pred
        btree_delete(node.children[i], pred, t)
        return True

    # Key might be in child subtree
    child_idx = i if node.leaf else i
    child = node.children[child_idx] if not node.leaf else None

    if child is None:
        return False

    if len(child.keys) < t - 1:
        # Case 3: child would underflow
        # Try borrow from left sibling
        if child_idx > 0 and len(node.children[child_idx - 1].keys) > t - 1:
            borrow_from_left(node, child_idx)
        # Try borrow from right sibling
        elif child_idx < len(node.children) - 1 and \
             len(node.children[child_idx + 1].keys) > t - 1:
            borrow_from_right(node, child_idx)
        else:
            # Must merge with a sibling
            merge_node(node, child_idx, t)

    return btree_delete(node.children[child_idx], key, t)

Deletion in a B-Tree maintains the structural invariants so that performance stays O(log n) even under heavy write workloads.

B+Trees: The Database Standard

B+Trees are a variant where only leaf nodes contain data pointers. Internal nodes store only keys used for routing. This small change produces major improvements for database workloads.

B+Tree Structure

B+Tree (t=2):
         [50, 100]
        /    |     \
   [25,40] [75]  [150,200]
      ↓      ↓      ↓
   (data)  (data)  (data)

Internal nodes contain separator keys only. No data, no record pointers.
Leaf nodes contain all keys plus pointers to actual data rows.
Leaf nodes are linked in a sorted linked list for sequential traversal.

Python: B+Tree Leaf Node

class BPlusLeafNode:
    def __init__(self):
        self.keys = []
        self.records = []      # pointers to data rows
        self.next_leaf = None  # linked-list pointer
        self.prev_leaf = None

class BPlusInternalNode:
    def __init__(self):
        self.keys = []
        self.children = []

class BPlusTree:
    def __init__(self, order):
        self.order = order
        self.root = BPlusLeafNode()

B+Tree Range Query

Range queries are the killer feature of B+Trees. Because leaf nodes form a linked list, scanning from key_start to key_end is purely sequential:

def bplus_range_query(tree, key_start, key_end):
    """Return all records with keys in [key_start, key_end]."""
    leaf = find_leaf(tree.root, key_start)
    results = []

    while leaf is not None:
        for i, key in enumerate(leaf.keys):
            if key_start <= key <= key_end:
                results.append(leaf.records[i])
            if key > key_end:
                return results
        leaf = leaf.next_leaf

    return results

Without the linked leaves, a B-Tree would need parent back-tracking for range scans—much slower.

Why Databases Use B+Trees

Databases choose B+Trees over classic B-Trees for five concrete reasons:

1. Sequential Access for Range Queries

In a B-Tree, to scan from key A to key B, you must traverse internal nodes. In a B+Tree, you follow the leaf linked list. Sequential disk access is ~100× faster than random access.

2. Higher Fanout in Internal Nodes

Internal nodes store only keys, not data pointers. A 4KB page can hold roughly 500 integers instead of 250 integers + 250 record pointers. More keys per node means lower tree height.

3. Cache Efficiency

The working set of internal nodes is smaller, so more of them fit in the database buffer pool. Searches frequently hit the cache after the first I/O.

4. Predictable I/O Patterns

Searches touch exactly one path from root to leaf (h nodes). Range scans touch consecutive leaf pages. The I/O pattern is simple and predictable, which helps query planners estimate costs.

5. Concurrency

Locking in B+Trees is simpler because data lives only in leaves. Internal nodes can be read-locked while leaf nodes are write-locked independently.

Index Types: Clustered vs Non-Clustered

Databases organize B+Trees into two index flavors.

Clustered Index (Primary Index)

The table data is stored in the leaf nodes of the B+Tree.
There is exactly one clustered index per table.
Rows are physically ordered on disk according to the key.
Range scans are extremely fast because data is stored sequentially.
MySQL InnoDB always has a clustered index (typically on the primary key).

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);
-- InnoDB creates a clustered B+Tree index on id automatically

Non-Clustered Index (Secondary Index)

Leaf nodes store the key plus a pointer to the corresponding row (or to the clustered index key).
A table can have many non-clustered indexes.
Querying by a non-clustered index requires two B+Tree lookups: first the index, then the clustered index for the row data.

CREATE INDEX idx_users_email ON users(email);
-- B+Tree stores (email → id), query needs two lookups

Visual: Clustered vs Non-Clustered Lookup

Clustered index on id:
  B+Tree leaf: [id=42 | row data (name, email, ...)]
  One lookup → data.

Non-clustered index on email:
  B+Tree leaf: ["[email protected]" | id=42]
  Second lookup: B+Tree on id → [id=42 | row data]
  Two lookups → data.

B-Trees in Production Databases

PostgreSQL

PostgreSQL uses B+Trees (implemented as nbtree) as the default index type. The implementation includes several optimizations:

Deduplication: When the same key appears many times (common with low-cardinality columns), PostgreSQL stores duplicate keys efficiently using posting lists.
Suffix truncation: Internal nodes store only the minimum key prefix needed for routing, saving space and increasing fanout.
Bottom-up deletion: PostgreSQL can remove dead index entries in bulk during reads, reducing bloat without VACUUM.

-- Create a B-Tree index in PostgreSQL
CREATE INDEX idx_orders_created_at ON orders(created_at);

-- See the query plan using the index
EXPLAIN ANALYZE
SELECT * FROM orders WHERE created_at >= '2025-01-01';
-- Output:
-- Index Scan using idx_orders_created_at on orders
--   Index Cond: (created_at >= '2025-01-01'::date)
--   Execution Time: 0.042 ms

Unused indexes hurt write performance. Use pg_stat_user_indexes to identify unused indexes:

SELECT relname, indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

MySQL (InnoDB)

InnoDB uses B+Trees exclusively:

The primary key index is always a clustered B+Tree.
Secondary indexes store the primary key as the row pointer (not a physical address).
This means a secondary index lookup in InnoDB requires two B+Tree traversals.

-- InnoDB automatically clusters on the primary key
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_id INT,
    amount DECIMAL(10, 2),
    INDEX idx_user_id (user_id)
) ENGINE=InnoDB;

Choosing a good primary key matters because all secondary indexes reference it. UUIDs as primary keys cause B+Tree fragmentation because inserts are random rather than sequential.

MongoDB

MongoDB uses B+Trees for all its indexes. The _id field gets a unique B+Tree index by default. Additional indexes are B+Trees that map indexed field values to document locations.

// Default _id index (B+Tree)
db.collection.getIndexes();

// Create a compound B+Tree index
db.orders.createIndex({ user_id: 1, created_at: -1 });

// MongoDB uses the index for range queries
db.orders.find({
    user_id: "alice",
    created_at: { $gte: ISODate("2025-01-01") }
}).explain("executionStats");
// → "IXSCAN" stage indicates B+Tree index usage

MongoDB’s WiredTiger storage engine uses B+Trees and adds prefix compression and page-level encryption.

B-Tree vs LSM-Tree

Log-Structured Merge-Trees (LSM-Trees) are the primary alternative to B+Trees. Used by LevelDB, RocksDB, Cassandra, and ScyllaDB, LSM-Trees take a fundamentally different approach.

LSM-Tree Write Path

Writes go to an in-memory sorted structure (memtable). When the memtable is full, it is flushed to disk as a sorted immutable file (SSTable). Over time, background compaction merges SSTables.

def lsm_write(memtable, wal_log, sstables, key, value):
    """Write path for an LSM-Tree."""
    # 1. Append to write-ahead log for durability
    wal_log.append((key, value))

    # 2. Insert into in-memory memtable (sorted)
    memtable[key] = value

    # 3. Flush to disk when memtable is full
    if len(memtable) > MEMTABLE_SIZE:
        sstable_id = flush_to_sstable(memtable)
        sstables.append(sstable_id)
        memtable.clear()

LSM-Tree Read Path

Reads check the memtable first, then search SSTables from newest to oldest. To avoid reading many files, LSM-Trees maintain bloom filters and a sparse index per SSTable.

def lsm_read(memtable, sstables, key):
    """Read path for an LSM-Tree."""
    if key in memtable:
        return memtable[key]

    for sstable in reversed(sstables):  # newest first
        if sstable.bloom_filter.might_contain(key):
            value = sstable.get(key)
            if value is not None:
                return value

    return None

Key Tradeoffs

Aspect	B+Tree	LSM-Tree
Write amplification	Low (~1-3×)	High (~10-50×) due to compaction
Read amplification	Low (tree height)	Medium-high (check multiple SSTables)
Space amplification	Low	Medium (tombstones, temporary duplicates)
Random write speed	Moderate	Very fast (sequential flushes)
Range scan speed	Fast (linked leaves)	Good (sorted SSTables)
Write stall risk	Under full pages	Under compaction pressure

Performance Characteristics

Operation	B-Tree	B+Tree	LSM-Tree	Hash Index
Point lookup	O(log n)	O(log n)	O(log n) amortized	O(1) avg
Range scan	O(k + log n)	O(k + log n)	O(k + log n)	O(n)
Insert	O(log n)	O(log n)	O(log n) amortized	O(1) avg
Delete	O(log n)	O(log n)	O(log n) amortized	O(1) avg
Disk writes per op	Low (in-place)	Low (in-place)	High (compaction)	Low
Space overhead	Low	Low	Medium	High (load factor)
Concurrency	Moderate	Good	Good	Simple

Hash indexes provide the fastest point lookups but cannot do range scans at all. B+Trees offer the best all-around performance for general-purpose database workloads.

Practical Considerations

Choosing the Order

The B+Tree order (equivalent to 2× minimum degree) is chosen to fit a disk page:

4KB page → order ~200 (512 bytes per entry ≈ 200 entries)
8KB page → order ~400
16KB page → order ~800

Larger pages improve range scan performance but waste space on partial pages and amplify write costs.

Page Splits and Merges

When a page splits, half the keys move to a new page and the median goes to the parent. This leaves pages at ~50% occupancy initially. Over time, occupancy fluctuates. PostgreSQL targets ~90% average occupancy through careful split heuristics.

Bulk Loading

Building a B+Tree from scratch by inserting rows one at a time is inefficient (each insert traverses the tree and may cause splits). Bulk loading builds the tree bottom-up:

Sort all input keys.
Pack keys into leaf pages.
Build internal pages bottom-up.

def bplus_bulk_load(sorted_keys, records, order):
    """Bottom-up B+Tree construction."""
    # Step 1: Build leaf pages
    leaves = []
    for i in range(0, len(sorted_keys), order):
        leaf_page = BPlusLeafNode()
        leaf_page.keys = sorted_keys[i:i + order]
        leaf_page.records = records[i:i + order]
        leaves.append(leaf_page)

    # Link leaf pages
    for i in range(len(leaves) - 1):
        leaves[i].next_leaf = leaves[i + 1]

    # Step 2: Build internal pages from leaf pages
    root = build_internal_level(leaves, order)
    return root

PostgreSQL’s CREATE INDEX and MySQL’s ALTER TABLE ADD INDEX both use bulk loading.

Prefix Compression

Internal nodes do not need full keys—only enough bytes to distinguish between adjacent child subtrees. PostgreSQL’s nbtree index stores only the minimal prefix needed for comparison, effectively doubling the fanout for text columns.

When to Use Which Structure

Choose B+Tree when:

Workload is read-heavy or mixed (OLTP, web apps)
You need range queries, ORDER BY, or GROUP BY
Data must be durable with minimal write amplification
You want predictable query performance regardless of write patterns

Choose LSM-Tree when:

Workload is write-heavy with bursts (time-series, logging)
You can tolerate read/write amplification tradeoffs
You need high ingest throughput on spinning disks
You use systems like Cassandra, RocksDB, or Bigtable

Choose Hash Index when:

Queries are exclusively point lookups (key = value)
No range queries or sorting needed
The index fits in memory

External Resources

R. Bayer and E. McCreight, “Organization and Maintenance of Large Ordered Indices” (1970) — The original B-Tree paper
Douglas Comer, “The Ubiquitous B-Tree” (1979) — Comprehensive survey of B-Tree variants
PostgreSQL Documentation: Index Types
MySQL InnoDB Index Architecture
MongoDB Indexing Documentation
CMU 15-445: B+Tree Indexes (Lecture Notes)
Goetz Graefe, “Modern B-Tree Techniques” (2011) — Advanced B-Tree optimizations

Conclusion

B-Trees and B+Trees are not just academic data structures—they are the engines that power the storage layer of nearly every database you use. The design choices that distinguish them from binary trees (high branching factor, leaf-linked lists, separation of routing from data) were driven by the physics of disk hardware and remain relevant even in the age of NVMe SSDs.

Understanding how B+Trees work helps you write better queries, choose the right indexes, and diagnose performance problems. When you create a composite index in PostgreSQL or see an IXSCAN in MongoDB, you now know exactly what the database is doing: walking a shallow, wide tree from root to leaf, following pointers through a few disk pages, and returning your data in milliseconds.

Introduction

The Problem with Binary Trees on Disk

B-Tree Structure

Python: B-Tree Node

B-Tree Operations

Search

Insertion

Deletion

B+Trees: The Database Standard

B+Tree Structure

Python: B+Tree Leaf Node

B+Tree Range Query

Why Databases Use B+Trees

1. Sequential Access for Range Queries

2. Higher Fanout in Internal Nodes

3. Cache Efficiency

4. Predictable I/O Patterns

5. Concurrency

Index Types: Clustered vs Non-Clustered

Clustered Index (Primary Index)

Non-Clustered Index (Secondary Index)

Visual: Clustered vs Non-Clustered Lookup

B-Trees in Production Databases

PostgreSQL

MySQL (InnoDB)

MongoDB

B-Tree vs LSM-Tree

LSM-Tree Write Path

LSM-Tree Read Path

Key Tradeoffs

Performance Characteristics

Practical Considerations

Choosing the Order

Page Splits and Merges

Bulk Loading

Prefix Compression

When to Use Which Structure

External Resources

Conclusion

Comments

Share this article

👍 Was this article helpful?