Skip to main content
โšก Calmops

PostgreSQL Internals: MVCC, Storage, and Query Processing

Introduction

Understanding PostgreSQL’s internal architecture helps you make better design decisions, optimize queries, and troubleshoot performance issues. This article explores the core components that make PostgreSQL work: from how it handles concurrent operations to how queries are executed.


Process Architecture

PostgreSQL Process Model

PostgreSQL uses a multi-process architecture with a parent process managing multiple child processes:

PostgreSQL Process Hierarchy
โ”œโ”€โ”€ postmaster (parent)
โ”‚   โ”œโ”€โ”€ autovacuum launcher
โ”‚   โ”œโ”€โ”€ background writer
โ”‚   โ”œโ”€โ”€ checkpointer
โ”‚   โ”œโ”€โ”€ logger
โ”‚   โ”œโ”€โ”€ wal writer
โ”‚   โ””โ”€โ”€ backend processes (one per connection)
โ”‚       โ”œโ”€โ”€ postgres: user query execution
โ”‚       โ””โ”€โ”€ postgres: user query execution

Key Processes

-- View all PostgreSQL processes
SELECT pid, usename, application_name, backend_type
FROM pg_stat_activity;

-- Key processes:
-- postmaster: Parent process, listens for connections
-- postgres: Backend process for each client connection
-- autovacuum: Automatic VACUUM and ANALYZE
-- bgwriter: Writes dirty buffers to disk
-- walwriter: Writes WAL to disk
-- checkpointer: Performs checkpoints
-- logger: Writes to log files

Memory Architecture

PostgreSQL uses two types of memory:

-- Shared Memory (shared by all processes)
-- postgresql.conf settings:
shared_buffers = 256MB        -- Cache for table/index data
wal_buffers = 16MB            -- WAL buffers
wal_writer_flush_after = 1MB

-- Local Memory (per-backend)
work_mem = 4MB                -- Per sort/hash operation
maintenance_work_mem = 64MB   -- VACUUM, CREATE INDEX
temp_buffers = 8MB            -- Temporary tables

-- View memory usage
SELECT * FROM pg_backend_memory_contexts;

Storage and Page Layout

Page Structure

PostgreSQL stores data in 8KB pages. Each page has a fixed header followed by item pointers and actual data:

Page Layout (8KB)
+------------------+
|  PageHeaderData  |  (24 bytes)
|  - pd_lsn       |
|  - pd_tli       |
|  - pd_lower     |
|  - pd_upper     |
|  - pd_special  |
+------------------+
|   ItemIdData    |  (pointers to items)
|   - lp_off      |
|   - lp_flags    |
|   - lp_len      |
+------------------+
|                 |
|   Tuple Data    |
|   (actual rows) |
|                 |
+------------------+
|   Special Space |  (index-specific data)
+------------------+

Tuple (Row) Structure

// Tuple header (minimal)
typedef struct HeapTupleHeaderData {
    uint32 t_xmin;           // Inserting transaction ID
    uint32 t_xmax;           // Deleting/updating transaction ID
    uint32 t_cmin;           // Command ID within transaction
    uint32 t_cmax;           // Command ID that deleted/updated
    uint32 t_ctid;           // Current tuple ID (self or new version)
    uint16 t_hoff;           // Offset to user data
    bits8 t_bits[1];         // Null bitmap
    char t_oid;              // Object ID (if has OID)
} HeapTupleHeaderData;

// System columns available in every table:
// - ctid: Physical location of tuple
// - xmin: Transaction that created this row
// - xmax: Transaction that deleted/updated this row
// - cmin: Command ID within transaction
// - cmax: Command ID that changed this row
// - tableoid: Table's OID

TOAST (The Oversized-Attribute Storage Technique)

-- TOAST compresses/breaks large values
-- Automatic for values > 2KB

-- Check TOAST usage
SELECT relname, 
       reltoastrelid::regclass AS toast_table,
       pg_size_pretty(pg_total_relation_size(reltoastrelid)) AS toast_size
FROM pg_class 
WHERE relkind = 'r' 
AND reltoastrelid != 0;

-- TOAST strategies (per column)
CREATE TABLE documents (
    id SERIAL,
    content text,
    large_data bytea
);
ALTER TABLE documents 
ALTER COLUMN content SET STORAGE EXTENDED;  -- Allows compression
ALTER TABLE documents 
ALTER COLUMN large_data SET STORAGE EXTERNAL;  -- No compression

MVCC (Multi-Version Concurrency Control)

How MVCC Works

PostgreSQL uses MVCC to allow concurrent transactions without locking:

-- Every row has hidden system columns:
-- xmin: Transaction that created the row
-- xmax: Transaction that deleted/updated the row

-- Example: Transaction visibility
-- Transaction 1: INSERT
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
COMMIT;

-- Transaction 2: READ (sees the inserted row)
BEGIN;
SELECT * FROM accounts;  -- Sees id=1, balance=1000
COMMIT;

-- Transaction 3: UPDATE
BEGIN;
UPDATE accounts SET balance = 2000 WHERE id = 1;
-- Now two versions exist:
-- Version 1: xmin=1, xmax=3 (old version, marked for deletion)
-- Version 2: xmin=3, xmax=0 (new version, visible)
COMMIT;

-- Transaction 4: READ (sees updated balance)
SELECT * FROM accounts;  -- Sees id=1, balance=2000

Transaction Visibility Rules

-- Check transaction visibility
-- A row is visible if:
-- 1. xmin is committed AND
-- 2. Transaction is not in progress AND
-- 3. Either xmax = 0 OR xmax is in progress OR xmax was rolled back

-- View row versions
SELECT ctid, xmin, xmax, * FROM accounts;

-- ctid: (page, offset) - physical location
-- xmin: Transaction that created this version
-- xmax: Transaction that replaced this version (0 if current)

Isolation Levels

-- READ COMMITTED (default)
-- Sees all committed changes from other transactions
-- Each statement sees snapshot at start

BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- Reads committed value
-- Another transaction updates balance
SELECT balance FROM accounts WHERE id = 1;  -- May see new value
COMMIT;

-- REPEATABLE READ
-- Sees consistent snapshot from first query
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Other transactions' updates are NOT visible
SELECT balance FROM accounts WHERE id = 1;  -- Same value
COMMIT;

-- SERIALIZABLE
-- Equivalent to serial execution
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Transactions are serialized; may get serialization failure

Write-Ahead Logging (WAL)

WAL Purpose

WAL ensures durability: changes are written to disk as log records before data pages are modified:

-- WAL structure
-- Each WAL record contains:
-- - Transaction ID
-- - LSN (Log Sequence Number)
-- - Previous LSN
-- - Page modifications

-- View WAL stats
SELECT * FROM pg_stat_wal;

-- View WAL files
SELECT * FROM pg_ls_waldir() ORDER BY name;

-- Checkpoint info
SELECT * FROM pg_stat_bgwriter;

Checkpoints

-- Force checkpoint
CHECKPOINT;

-- Configure checkpoints
-- postgresql.conf
checkpoint_timeout = 5min        # Time between checkpoints
checkpoint_completion_target = 0.9  # I/O spread
max_wal_size = 1GB              # Max WAL before checkpoint
min_wal_size = 80MB             # Minimum WAL to keep

-- View checkpoint history
SELECT * FROM pg_stat_checkpointer;

WAL Recovery Process

-- Recovery steps:
-- 1. Find last valid checkpoint
-- 2. Read WAL from checkpoint
-- 3. Redo all transactions
-- 4. Roll back uncommitted transactions

-- Point-in-time recovery
-- postgresql.auto.conf
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2026-03-05 10:00:00 UTC'
recovery_target_action = 'promote'

Query Processing

Query Flow

SQL Query
    โ†“
Parser (syntax check)
    โ†“
Rewriter (rules, views)
    โ†“
Planner/Optimizer (create execution plan)
    โ†“
Executor (run the plan)
    โ†“
Results

Parser

-- Parser checks SQL syntax
-- Creates parse tree
-- Validates table/column names

-- Example parse tree for:
-- SELECT name FROM users WHERE id = 1;

-- ParseTree:
-- SelectStmt
--   - targetList: [name]
--   - fromClause: [users]
--   - whereClause: [id = 1]

Rewriter

-- Rewriter applies:
-- - Rules (CREATE RULE)
-- - Views
-- - Materialized views

-- Example: View definition
CREATE VIEW user_orders AS
SELECT u.name, o.total_amount
FROM users u
JOIN orders o ON u.id = o.user_id;

-- Rewriter expands view into underlying query

Planner/Optimizer

-- Planner creates execution plan
EXPLAIN SELECT * FROM users u 
JOIN orders o ON u.id = o.user_id 
WHERE u.created_at > '2026-01-01';

-- Execution Plan:
-- Nested Loop (cost=...)
--   -> Index Scan on users (created_at)
--   -> Index Lookup on orders (user_id)

Join Methods

-- Nested Loop Join
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o 
JOIN users u ON o.user_id = u.id;

-- Hash Join (for large tables)
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o 
JOIN users u ON o.user_id = u.id;

-- Merge Join (for pre-sorted data)
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o 
JOIN users u ON o.user_id = u.id
ORDER BY user_id;

Executor

-- Executor runs plan nodes:
-- - Sequential Scan
-- - Index Scan / Index Only Scan
-- - Bitmap Scan
-- - Nested Loop
-- - Hash Join / Hash Aggregate
-- - Sort / Limit

-- Run with timing
EXPLAIN (ANALYZE, TIMING, BUFFERS)
SELECT * FROM orders WHERE total_amount > 1000;

Index Internals

B-Tree Index

-- B-Tree is default and most common
CREATE INDEX idx_users_email ON users(email);

-- B-Tree structure:
-- Root (page)
--   โ”œโ”€โ”€ Branch pages
--   โ”‚    โ””โ”€โ”€ ...
--   โ””โ”€โ”€ Leaf pages (actual index entries)

-- Index entries point to heap tuples via TID
-- (page_number, item_pointer)

Index Types Comparison

-- B-Tree: Equality and range queries
CREATE INDEX idx ON table (column);
CREATE INDEX idx ON table (column DESC);

-- Hash: Equality only
CREATE INDEX idx ON table USING HASH (column);

-- GiST: Multiple indexing strategies
-- Full-text search, geometric, etc.
CREATE INDEX idx ON articles USING GIST (to_tsvector('english', content));

-- GIN: Inverted index for composite types
-- JSON, arrays, full-text
CREATE INDEX idx ON documents USING GIN (data);

-- BRIN: Block Range Index
-- For sequentially stored data
CREATE INDEX idx ON logs USING BRIN (created_at);

Index-Only Scans

-- Index-only scan avoids heap access
-- Only works if all columns are in index

CREATE INDEX idx_users_name_email ON users(name, email);

EXPLAIN SELECT name, email FROM users WHERE name = 'John';
-- Index Only Scan using idx_users_name_email

EXPLAIN SELECT * FROM users WHERE name = 'John';
-- Index Scan (needs to access heap for other columns)

VACUUM and Autovacuum

Why VACUUM is Needed

-- MVCC creates dead tuples
-- VACUUM reclaims space

-- Example:
UPDATE users SET email = '[email protected]' WHERE id = 1;
-- Creates new row version
-- Old version becomes "dead tuple"

-- VACUUM reclaims the space
VACUUM users;

VACUUM Types

-- VACUUM: Reclaims space, marks as reusable
VACUUM users;

-- VACUUM FULL: Rewrites entire table, removes all bloat
-- Requires exclusive lock
VACUUM FULL users;

-- VACUUM ANALYZE: VACUUM + update statistics
VACUUM ANALYZE users;

-- VACUUM FREEZE: Freeze old transaction IDs
VACUUM FREEZE users;

Autovacuum

-- postgresql.conf
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_insert_threshold = 1000
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1

-- Per-table autovacuum settings
ALTER TABLE users SET (autovacuum_vacuum_threshold = 1000);

Transaction ID Wraparound

-- PostgreSQL uses 32-bit transaction IDs
-- ~4 billion transactions possible
-- VACUUM FREEZE prevents wraparound

-- Check transaction age
SELECT relname, age(relfrozenxid) as xid_age
FROM pg_class 
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC;

-- Emergency vacuum if approaching wraparound
VACUUM FREEZE users;

Buffer Cache

Shared Buffers

-- View buffer statistics
SELECT 
    c.relname,
    count(*) as buffers,
    round(100.0 * count(*) / (SELECT count(*) FROM pg_buffercache), 2) as pct
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 20;

-- Cache hit ratio (should be > 95%)
SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    round(100.0 * sum(heap_blks_hit) / nullif(sum(heap_blks_read) + sum(heap_blks_hit), 0), 2) as ratio
FROM pg_statio_user_tables;

Locking

Lock Types

-- Table-level locks
ACCESS SHARE      -- SELECT
ROW SHARE         -- SELECT FOR UPDATE
ROW EXCLUSIVE     -- INSERT, UPDATE, DELETE
SHARE UPDATE EXCLUSIVE -- VACUUM, ANALYZE
SHARE             -- CREATE INDEX
SHARE ROW EXCLUSIVE -- ALTER TABLE
EXCLUSIVE         -- DROP TABLE
ACCESS EXCLUSIVE  -- Most DDL

-- View current locks
SELECT 
    locktype, database::regclass, relation::regclass, 
    mode, granted, pid
FROM pg_locks
WHERE database = (SELECT oid FROM pg_database WHERE datname = current_database());

-- Kill blocking query
SELECT pg_terminate_backend(pid) 
FROM pg_locks 
WHERE mode = 'ExclusiveLock' 
AND granted = true;

Extensions

-- Enable built-in extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";
CREATE EXTENSION IF NOT EXISTS "hstore";
CREATE EXTENSION IF NOT EXISTS "citext";

-- View installed extensions
SELECT * FROM pg_extension;

-- pg_stat_statements (query performance)
CREATE EXTENSION pg_stat_statements;

-- View slow queries
SELECT query, calls, mean_time, total_time 
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;

Conclusion

Understanding PostgreSQL’s internalsโ€”MVCC, WAL, the query planner, and storageโ€”helps you design better schemas, write faster queries, and diagnose performance issues. These fundamentals apply whether you’re tuning a simple application or managing a complex distributed system.

In the next article, we’ll explore PostgreSQL’s recent developments: new features in versions 17 and 18, the evolving extension ecosystem, and emerging capabilities.

Comments