Skip to main content

PostgreSQL Internals: MVCC, Storage, and Query Processing

Created: March 5, 2026 CalmOps 10 min read

Introduction

Understanding PostgreSQL’s internal architecture helps you make better design decisions, optimize queries, and troubleshoot performance issues. This article explores the core components that make PostgreSQL work: from how it handles concurrent operations to how queries are executed.


Process Architecture

PostgreSQL Process Model

PostgreSQL uses a multi-process architecture with a parent process managing multiple child processes:

PostgreSQL Process Hierarchy
├── postmaster (parent)
│   ├── autovacuum launcher
│   ├── background writer
│   ├── checkpointer
│   ├── logger
│   ├── wal writer
│   └── backend processes (one per connection)
│       ├── postgres: user query execution
│       └── postgres: user query execution

Key Processes

-- View all PostgreSQL processes
SELECT pid, usename, application_name, backend_type
FROM pg_stat_activity;

-- Key processes:
-- postmaster: Parent process, listens for connections
-- postgres: Backend process for each client connection
-- autovacuum: Automatic VACUUM and ANALYZE
-- bgwriter: Writes dirty buffers to disk
-- walwriter: Writes WAL to disk
-- checkpointer: Performs checkpoints
-- logger: Writes to log files

Memory Architecture

PostgreSQL uses two types of memory:

-- Shared Memory (shared by all processes)
-- postgresql.conf settings:
shared_buffers = 256MB        -- Cache for table/index data
wal_buffers = 16MB            -- WAL buffers
wal_writer_flush_after = 1MB

-- Local Memory (per-backend)
work_mem = 4MB                -- Per sort/hash operation
maintenance_work_mem = 64MB   -- VACUUM, CREATE INDEX
temp_buffers = 8MB            -- Temporary tables

-- View memory usage
SELECT * FROM pg_backend_memory_contexts;

Storage and Page Layout

Page Structure

PostgreSQL stores data in 8KB pages. Each page has a fixed header followed by item pointers and actual data:

Page Layout (8KB)
+------------------+
|  PageHeaderData  |  (24 bytes)
|  - pd_lsn       |
|  - pd_tli       |
|  - pd_lower     |
|  - pd_upper     |
|  - pd_special  |
+------------------+
|   ItemIdData    |  (pointers to items)
|   - lp_off      |
|   - lp_flags    |
|   - lp_len      |
+------------------+
|                 |
|   Tuple Data    |
|   (actual rows) |
|                 |
+------------------+
|   Special Space |  (index-specific data)
+------------------+

Tuple (Row) Structure

// Tuple header (minimal)
typedef struct HeapTupleHeaderData {
    uint32 t_xmin;           // Inserting transaction ID
    uint32 t_xmax;           // Deleting/updating transaction ID
    uint32 t_cmin;           // Command ID within transaction
    uint32 t_cmax;           // Command ID that deleted/updated
    uint32 t_ctid;           // Current tuple ID (self or new version)
    uint16 t_hoff;           // Offset to user data
    bits8 t_bits[1];         // Null bitmap
    char t_oid;              // Object ID (if has OID)
} HeapTupleHeaderData;

// System columns available in every table:
// - ctid: Physical location of tuple
// - xmin: Transaction that created this row
// - xmax: Transaction that deleted/updated this row
// - cmin: Command ID within transaction
// - cmax: Command ID that changed this row
// - tableoid: Table's OID

TOAST (The Oversized-Attribute Storage Technique)

-- TOAST compresses/breaks large values
-- Automatic for values > 2KB

-- Check TOAST usage
SELECT relname, 
       reltoastrelid::regclass AS toast_table,
       pg_size_pretty(pg_total_relation_size(reltoastrelid)) AS toast_size
FROM pg_class 
WHERE relkind = 'r' 
AND reltoastrelid != 0;

-- TOAST strategies (per column)
CREATE TABLE documents (
    id SERIAL,
    content text,
    large_data bytea
);
ALTER TABLE documents 
ALTER COLUMN content SET STORAGE EXTENDED;  -- Allows compression
ALTER TABLE documents 
ALTER COLUMN large_data SET STORAGE EXTERNAL;  -- No compression

MVCC (Multi-Version Concurrency Control)

How MVCC Works

PostgreSQL uses MVCC to allow concurrent transactions without locking:

-- Every row has hidden system columns:
-- xmin: Transaction that created the row
-- xmax: Transaction that deleted/updated the row

-- Example: Transaction visibility
-- Transaction 1: INSERT
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
COMMIT;

-- Transaction 2: READ (sees the inserted row)
BEGIN;
SELECT * FROM accounts;  -- Sees id=1, balance=1000
COMMIT;

-- Transaction 3: UPDATE
BEGIN;
UPDATE accounts SET balance = 2000 WHERE id = 1;
-- Now two versions exist:
-- Version 1: xmin=1, xmax=3 (old version, marked for deletion)
-- Version 2: xmin=3, xmax=0 (new version, visible)
COMMIT;

-- Transaction 4: READ (sees updated balance)
SELECT * FROM accounts;  -- Sees id=1, balance=2000

Transaction Visibility Rules

-- Check transaction visibility
-- A row is visible if:
-- 1. xmin is committed AND
-- 2. Transaction is not in progress AND
-- 3. Either xmax = 0 OR xmax is in progress OR xmax was rolled back

-- View row versions
SELECT ctid, xmin, xmax, * FROM accounts;

-- ctid: (page, offset) - physical location
-- xmin: Transaction that created this version
-- xmax: Transaction that replaced this version (0 if current)

Isolation Levels

-- READ COMMITTED (default)
-- Sees all committed changes from other transactions
-- Each statement sees snapshot at start

BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- Reads committed value
-- Another transaction updates balance
SELECT balance FROM accounts WHERE id = 1;  -- May see new value
COMMIT;

-- REPEATABLE READ
-- Sees consistent snapshot from first query
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Other transactions' updates are NOT visible
SELECT balance FROM accounts WHERE id = 1;  -- Same value
COMMIT;

-- SERIALIZABLE
-- Equivalent to serial execution
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Transactions are serialized; may get serialization failure

Write-Ahead Logging (WAL)

WAL Purpose

WAL ensures durability: changes are written to disk as log records before data pages are modified:

-- WAL structure
-- Each WAL record contains:
-- - Transaction ID
-- - LSN (Log Sequence Number)
-- - Previous LSN
-- - Page modifications

-- View WAL stats
SELECT * FROM pg_stat_wal;

-- View WAL files
SELECT * FROM pg_ls_waldir() ORDER BY name;

-- Checkpoint info
SELECT * FROM pg_stat_bgwriter;

Checkpoints

-- Force checkpoint
CHECKPOINT;

-- Configure checkpoints
-- postgresql.conf
checkpoint_timeout = 5min        # Time between checkpoints
checkpoint_completion_target = 0.9  # I/O spread
max_wal_size = 1GB              # Max WAL before checkpoint
min_wal_size = 80MB             # Minimum WAL to keep

-- View checkpoint history
SELECT * FROM pg_stat_checkpointer;

WAL Recovery Process

-- Recovery steps:
-- 1. Find last valid checkpoint
-- 2. Read WAL from checkpoint
-- 3. Redo all transactions
-- 4. Roll back uncommitted transactions

-- Point-in-time recovery
-- postgresql.auto.conf
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2026-03-05 10:00:00 UTC'
recovery_target_action = 'promote'

Query Processing

Query Flow

SQL Query
    ↓
Parser (syntax check)
    ↓
Rewriter (rules, views)
    ↓
Planner/Optimizer (create execution plan)
    ↓
Executor (run the plan)
    ↓
Results

Parser

-- Parser checks SQL syntax
-- Creates parse tree
-- Validates table/column names

-- Example parse tree for:
-- SELECT name FROM users WHERE id = 1;

-- ParseTree:
-- SelectStmt
--   - targetList: [name]
--   - fromClause: [users]
--   - whereClause: [id = 1]

Rewriter

-- Rewriter applies:
-- - Rules (CREATE RULE)
-- - Views
-- - Materialized views

-- Example: View definition
CREATE VIEW user_orders AS
SELECT u.name, o.total_amount
FROM users u
JOIN orders o ON u.id = o.user_id;

-- Rewriter expands view into underlying query

Planner/Optimizer

-- Planner creates execution plan
EXPLAIN SELECT * FROM users u 
JOIN orders o ON u.id = o.user_id 
WHERE u.created_at > '2026-01-01';

-- Execution Plan:
-- Nested Loop (cost=...)
--   -> Index Scan on users (created_at)
--   -> Index Lookup on orders (user_id)

Join Methods

-- Nested Loop Join
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o 
JOIN users u ON o.user_id = u.id;

-- Hash Join (for large tables)
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o 
JOIN users u ON o.user_id = u.id;

-- Merge Join (for pre-sorted data)
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o 
JOIN users u ON o.user_id = u.id
ORDER BY user_id;

Executor

-- Executor runs plan nodes:
-- - Sequential Scan
-- - Index Scan / Index Only Scan
-- - Bitmap Scan
-- - Nested Loop
-- - Hash Join / Hash Aggregate
-- - Sort / Limit

-- Run with timing
EXPLAIN (ANALYZE, TIMING, BUFFERS)
SELECT * FROM orders WHERE total_amount > 1000;

Index Internals

B-Tree Index

-- B-Tree is default and most common
CREATE INDEX idx_users_email ON users(email);

-- B-Tree structure:
-- Root (page)
--   ├── Branch pages
--   │    └── ...
--   └── Leaf pages (actual index entries)

-- Index entries point to heap tuples via TID
-- (page_number, item_pointer)

Index Types Comparison

-- B-Tree: Equality and range queries
CREATE INDEX idx ON table (column);
CREATE INDEX idx ON table (column DESC);

-- Hash: Equality only
CREATE INDEX idx ON table USING HASH (column);

-- GiST: Multiple indexing strategies
-- Full-text search, geometric, etc.
CREATE INDEX idx ON articles USING GIST (to_tsvector('english', content));

-- GIN: Inverted index for composite types
-- JSON, arrays, full-text
CREATE INDEX idx ON documents USING GIN (data);

-- BRIN: Block Range Index
-- For sequentially stored data
CREATE INDEX idx ON logs USING BRIN (created_at);

Index-Only Scans

-- Index-only scan avoids heap access
-- Only works if all columns are in index

CREATE INDEX idx_users_name_email ON users(name, email);

EXPLAIN SELECT name, email FROM users WHERE name = 'John';
-- Index Only Scan using idx_users_name_email

EXPLAIN SELECT * FROM users WHERE name = 'John';
-- Index Scan (needs to access heap for other columns)

VACUUM and Autovacuum

Why VACUUM is Needed

-- MVCC creates dead tuples
-- VACUUM reclaims space

-- Example:
UPDATE users SET email = '[email protected]' WHERE id = 1;
-- Creates new row version
-- Old version becomes "dead tuple"

-- VACUUM reclaims the space
VACUUM users;

VACUUM Types

-- VACUUM: Reclaims space, marks as reusable
VACUUM users;

-- VACUUM FULL: Rewrites entire table, removes all bloat
-- Requires exclusive lock
VACUUM FULL users;

-- VACUUM ANALYZE: VACUUM + update statistics
VACUUM ANALYZE users;

-- VACUUM FREEZE: Freeze old transaction IDs
VACUUM FREEZE users;

Autovacuum

-- postgresql.conf
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_insert_threshold = 1000
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1

-- Per-table autovacuum settings
ALTER TABLE users SET (autovacuum_vacuum_threshold = 1000);

Transaction ID Wraparound

-- PostgreSQL uses 32-bit transaction IDs
-- ~4 billion transactions possible
-- VACUUM FREEZE prevents wraparound

-- Check transaction age
SELECT relname, age(relfrozenxid) as xid_age
FROM pg_class 
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC;

-- Emergency vacuum if approaching wraparound
VACUUM FREEZE users;

Buffer Cache

Shared Buffers

-- View buffer statistics
SELECT 
    c.relname,
    count(*) as buffers,
    round(100.0 * count(*) / (SELECT count(*) FROM pg_buffercache), 2) as pct
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 20;

-- Cache hit ratio (should be > 95%)
SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    round(100.0 * sum(heap_blks_hit) / nullif(sum(heap_blks_read) + sum(heap_blks_hit), 0), 2) as ratio
FROM pg_statio_user_tables;

Locking

Lock Types

-- Table-level locks
ACCESS SHARE      -- SELECT
ROW SHARE         -- SELECT FOR UPDATE
ROW EXCLUSIVE     -- INSERT, UPDATE, DELETE
SHARE UPDATE EXCLUSIVE -- VACUUM, ANALYZE
SHARE             -- CREATE INDEX
SHARE ROW EXCLUSIVE -- ALTER TABLE
EXCLUSIVE         -- DROP TABLE
ACCESS EXCLUSIVE  -- Most DDL

-- View current locks
SELECT 
    locktype, database::regclass, relation::regclass, 
    mode, granted, pid
FROM pg_locks
WHERE database = (SELECT oid FROM pg_database WHERE datname = current_database());

-- Kill blocking query
SELECT pg_terminate_backend(pid) 
FROM pg_locks 
WHERE mode = 'ExclusiveLock' 
AND granted = true;

Extensions

-- Enable built-in extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";
CREATE EXTENSION IF NOT EXISTS "hstore";
CREATE EXTENSION IF NOT EXISTS "citext";

-- View installed extensions
SELECT * FROM pg_extension;

-- pg_stat_statements (query performance)
CREATE EXTENSION pg_stat_statements;

-- View slow queries
SELECT query, calls, mean_time, total_time 
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;

Conclusion

Understanding PostgreSQL’s internals—MVCC, WAL, the query planner, and storage—helps you design better schemas, write faster queries, and diagnose performance issues. These fundamentals apply whether you’re tuning a simple application or managing a complex distributed system.

In the next article, we’ll explore PostgreSQL’s recent developments: new features in versions 17 and 18, the evolving extension ecosystem, and emerging capabilities.

Comments

Share this article

Scan to read on mobile