Introduction
Understanding PostgreSQL’s internal architecture helps you make better design decisions, optimize queries, and troubleshoot performance issues. This article explores the core components that make PostgreSQL work: from how it handles concurrent operations to how queries are executed.
Process Architecture
PostgreSQL Process Model
PostgreSQL uses a multi-process architecture with a parent process managing multiple child processes:
PostgreSQL Process Hierarchy
โโโ postmaster (parent)
โ โโโ autovacuum launcher
โ โโโ background writer
โ โโโ checkpointer
โ โโโ logger
โ โโโ wal writer
โ โโโ backend processes (one per connection)
โ โโโ postgres: user query execution
โ โโโ postgres: user query execution
Key Processes
-- View all PostgreSQL processes
SELECT pid, usename, application_name, backend_type
FROM pg_stat_activity;
-- Key processes:
-- postmaster: Parent process, listens for connections
-- postgres: Backend process for each client connection
-- autovacuum: Automatic VACUUM and ANALYZE
-- bgwriter: Writes dirty buffers to disk
-- walwriter: Writes WAL to disk
-- checkpointer: Performs checkpoints
-- logger: Writes to log files
Memory Architecture
PostgreSQL uses two types of memory:
-- Shared Memory (shared by all processes)
-- postgresql.conf settings:
shared_buffers = 256MB -- Cache for table/index data
wal_buffers = 16MB -- WAL buffers
wal_writer_flush_after = 1MB
-- Local Memory (per-backend)
work_mem = 4MB -- Per sort/hash operation
maintenance_work_mem = 64MB -- VACUUM, CREATE INDEX
temp_buffers = 8MB -- Temporary tables
-- View memory usage
SELECT * FROM pg_backend_memory_contexts;
Storage and Page Layout
Page Structure
PostgreSQL stores data in 8KB pages. Each page has a fixed header followed by item pointers and actual data:
Page Layout (8KB)
+------------------+
| PageHeaderData | (24 bytes)
| - pd_lsn |
| - pd_tli |
| - pd_lower |
| - pd_upper |
| - pd_special |
+------------------+
| ItemIdData | (pointers to items)
| - lp_off |
| - lp_flags |
| - lp_len |
+------------------+
| |
| Tuple Data |
| (actual rows) |
| |
+------------------+
| Special Space | (index-specific data)
+------------------+
Tuple (Row) Structure
// Tuple header (minimal)
typedef struct HeapTupleHeaderData {
uint32 t_xmin; // Inserting transaction ID
uint32 t_xmax; // Deleting/updating transaction ID
uint32 t_cmin; // Command ID within transaction
uint32 t_cmax; // Command ID that deleted/updated
uint32 t_ctid; // Current tuple ID (self or new version)
uint16 t_hoff; // Offset to user data
bits8 t_bits[1]; // Null bitmap
char t_oid; // Object ID (if has OID)
} HeapTupleHeaderData;
// System columns available in every table:
// - ctid: Physical location of tuple
// - xmin: Transaction that created this row
// - xmax: Transaction that deleted/updated this row
// - cmin: Command ID within transaction
// - cmax: Command ID that changed this row
// - tableoid: Table's OID
TOAST (The Oversized-Attribute Storage Technique)
-- TOAST compresses/breaks large values
-- Automatic for values > 2KB
-- Check TOAST usage
SELECT relname,
reltoastrelid::regclass AS toast_table,
pg_size_pretty(pg_total_relation_size(reltoastrelid)) AS toast_size
FROM pg_class
WHERE relkind = 'r'
AND reltoastrelid != 0;
-- TOAST strategies (per column)
CREATE TABLE documents (
id SERIAL,
content text,
large_data bytea
);
ALTER TABLE documents
ALTER COLUMN content SET STORAGE EXTENDED; -- Allows compression
ALTER TABLE documents
ALTER COLUMN large_data SET STORAGE EXTERNAL; -- No compression
MVCC (Multi-Version Concurrency Control)
How MVCC Works
PostgreSQL uses MVCC to allow concurrent transactions without locking:
-- Every row has hidden system columns:
-- xmin: Transaction that created the row
-- xmax: Transaction that deleted/updated the row
-- Example: Transaction visibility
-- Transaction 1: INSERT
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
COMMIT;
-- Transaction 2: READ (sees the inserted row)
BEGIN;
SELECT * FROM accounts; -- Sees id=1, balance=1000
COMMIT;
-- Transaction 3: UPDATE
BEGIN;
UPDATE accounts SET balance = 2000 WHERE id = 1;
-- Now two versions exist:
-- Version 1: xmin=1, xmax=3 (old version, marked for deletion)
-- Version 2: xmin=3, xmax=0 (new version, visible)
COMMIT;
-- Transaction 4: READ (sees updated balance)
SELECT * FROM accounts; -- Sees id=1, balance=2000
Transaction Visibility Rules
-- Check transaction visibility
-- A row is visible if:
-- 1. xmin is committed AND
-- 2. Transaction is not in progress AND
-- 3. Either xmax = 0 OR xmax is in progress OR xmax was rolled back
-- View row versions
SELECT ctid, xmin, xmax, * FROM accounts;
-- ctid: (page, offset) - physical location
-- xmin: Transaction that created this version
-- xmax: Transaction that replaced this version (0 if current)
Isolation Levels
-- READ COMMITTED (default)
-- Sees all committed changes from other transactions
-- Each statement sees snapshot at start
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- Reads committed value
-- Another transaction updates balance
SELECT balance FROM accounts WHERE id = 1; -- May see new value
COMMIT;
-- REPEATABLE READ
-- Sees consistent snapshot from first query
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Other transactions' updates are NOT visible
SELECT balance FROM accounts WHERE id = 1; -- Same value
COMMIT;
-- SERIALIZABLE
-- Equivalent to serial execution
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Transactions are serialized; may get serialization failure
Write-Ahead Logging (WAL)
WAL Purpose
WAL ensures durability: changes are written to disk as log records before data pages are modified:
-- WAL structure
-- Each WAL record contains:
-- - Transaction ID
-- - LSN (Log Sequence Number)
-- - Previous LSN
-- - Page modifications
-- View WAL stats
SELECT * FROM pg_stat_wal;
-- View WAL files
SELECT * FROM pg_ls_waldir() ORDER BY name;
-- Checkpoint info
SELECT * FROM pg_stat_bgwriter;
Checkpoints
-- Force checkpoint
CHECKPOINT;
-- Configure checkpoints
-- postgresql.conf
checkpoint_timeout = 5min # Time between checkpoints
checkpoint_completion_target = 0.9 # I/O spread
max_wal_size = 1GB # Max WAL before checkpoint
min_wal_size = 80MB # Minimum WAL to keep
-- View checkpoint history
SELECT * FROM pg_stat_checkpointer;
WAL Recovery Process
-- Recovery steps:
-- 1. Find last valid checkpoint
-- 2. Read WAL from checkpoint
-- 3. Redo all transactions
-- 4. Roll back uncommitted transactions
-- Point-in-time recovery
-- postgresql.auto.conf
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2026-03-05 10:00:00 UTC'
recovery_target_action = 'promote'
Query Processing
Query Flow
SQL Query
โ
Parser (syntax check)
โ
Rewriter (rules, views)
โ
Planner/Optimizer (create execution plan)
โ
Executor (run the plan)
โ
Results
Parser
-- Parser checks SQL syntax
-- Creates parse tree
-- Validates table/column names
-- Example parse tree for:
-- SELECT name FROM users WHERE id = 1;
-- ParseTree:
-- SelectStmt
-- - targetList: [name]
-- - fromClause: [users]
-- - whereClause: [id = 1]
Rewriter
-- Rewriter applies:
-- - Rules (CREATE RULE)
-- - Views
-- - Materialized views
-- Example: View definition
CREATE VIEW user_orders AS
SELECT u.name, o.total_amount
FROM users u
JOIN orders o ON u.id = o.user_id;
-- Rewriter expands view into underlying query
Planner/Optimizer
-- Planner creates execution plan
EXPLAIN SELECT * FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2026-01-01';
-- Execution Plan:
-- Nested Loop (cost=...)
-- -> Index Scan on users (created_at)
-- -> Index Lookup on orders (user_id)
Join Methods
-- Nested Loop Join
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id;
-- Hash Join (for large tables)
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id;
-- Merge Join (for pre-sorted data)
EXPLAIN (FORMAT TEXT)
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
ORDER BY user_id;
Executor
-- Executor runs plan nodes:
-- - Sequential Scan
-- - Index Scan / Index Only Scan
-- - Bitmap Scan
-- - Nested Loop
-- - Hash Join / Hash Aggregate
-- - Sort / Limit
-- Run with timing
EXPLAIN (ANALYZE, TIMING, BUFFERS)
SELECT * FROM orders WHERE total_amount > 1000;
Index Internals
B-Tree Index
-- B-Tree is default and most common
CREATE INDEX idx_users_email ON users(email);
-- B-Tree structure:
-- Root (page)
-- โโโ Branch pages
-- โ โโโ ...
-- โโโ Leaf pages (actual index entries)
-- Index entries point to heap tuples via TID
-- (page_number, item_pointer)
Index Types Comparison
-- B-Tree: Equality and range queries
CREATE INDEX idx ON table (column);
CREATE INDEX idx ON table (column DESC);
-- Hash: Equality only
CREATE INDEX idx ON table USING HASH (column);
-- GiST: Multiple indexing strategies
-- Full-text search, geometric, etc.
CREATE INDEX idx ON articles USING GIST (to_tsvector('english', content));
-- GIN: Inverted index for composite types
-- JSON, arrays, full-text
CREATE INDEX idx ON documents USING GIN (data);
-- BRIN: Block Range Index
-- For sequentially stored data
CREATE INDEX idx ON logs USING BRIN (created_at);
Index-Only Scans
-- Index-only scan avoids heap access
-- Only works if all columns are in index
CREATE INDEX idx_users_name_email ON users(name, email);
EXPLAIN SELECT name, email FROM users WHERE name = 'John';
-- Index Only Scan using idx_users_name_email
EXPLAIN SELECT * FROM users WHERE name = 'John';
-- Index Scan (needs to access heap for other columns)
VACUUM and Autovacuum
Why VACUUM is Needed
-- MVCC creates dead tuples
-- VACUUM reclaims space
-- Example:
UPDATE users SET email = '[email protected]' WHERE id = 1;
-- Creates new row version
-- Old version becomes "dead tuple"
-- VACUUM reclaims the space
VACUUM users;
VACUUM Types
-- VACUUM: Reclaims space, marks as reusable
VACUUM users;
-- VACUUM FULL: Rewrites entire table, removes all bloat
-- Requires exclusive lock
VACUUM FULL users;
-- VACUUM ANALYZE: VACUUM + update statistics
VACUUM ANALYZE users;
-- VACUUM FREEZE: Freeze old transaction IDs
VACUUM FREEZE users;
Autovacuum
-- postgresql.conf
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_insert_threshold = 1000
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1
-- Per-table autovacuum settings
ALTER TABLE users SET (autovacuum_vacuum_threshold = 1000);
Transaction ID Wraparound
-- PostgreSQL uses 32-bit transaction IDs
-- ~4 billion transactions possible
-- VACUUM FREEZE prevents wraparound
-- Check transaction age
SELECT relname, age(relfrozenxid) as xid_age
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC;
-- Emergency vacuum if approaching wraparound
VACUUM FREEZE users;
Buffer Cache
Shared Buffers
-- View buffer statistics
SELECT
c.relname,
count(*) as buffers,
round(100.0 * count(*) / (SELECT count(*) FROM pg_buffercache), 2) as pct
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 20;
-- Cache hit ratio (should be > 95%)
SELECT
sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
round(100.0 * sum(heap_blks_hit) / nullif(sum(heap_blks_read) + sum(heap_blks_hit), 0), 2) as ratio
FROM pg_statio_user_tables;
Locking
Lock Types
-- Table-level locks
ACCESS SHARE -- SELECT
ROW SHARE -- SELECT FOR UPDATE
ROW EXCLUSIVE -- INSERT, UPDATE, DELETE
SHARE UPDATE EXCLUSIVE -- VACUUM, ANALYZE
SHARE -- CREATE INDEX
SHARE ROW EXCLUSIVE -- ALTER TABLE
EXCLUSIVE -- DROP TABLE
ACCESS EXCLUSIVE -- Most DDL
-- View current locks
SELECT
locktype, database::regclass, relation::regclass,
mode, granted, pid
FROM pg_locks
WHERE database = (SELECT oid FROM pg_database WHERE datname = current_database());
-- Kill blocking query
SELECT pg_terminate_backend(pid)
FROM pg_locks
WHERE mode = 'ExclusiveLock'
AND granted = true;
Extensions
Popular Extensions
-- Enable built-in extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";
CREATE EXTENSION IF NOT EXISTS "hstore";
CREATE EXTENSION IF NOT EXISTS "citext";
-- View installed extensions
SELECT * FROM pg_extension;
-- pg_stat_statements (query performance)
CREATE EXTENSION pg_stat_statements;
-- View slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
Conclusion
Understanding PostgreSQL’s internalsโMVCC, WAL, the query planner, and storageโhelps you design better schemas, write faster queries, and diagnose performance issues. These fundamentals apply whether you’re tuning a simple application or managing a complex distributed system.
In the next article, we’ll explore PostgreSQL’s recent developments: new features in versions 17 and 18, the evolving extension ecosystem, and emerging capabilities.
Comments