Introduction
Database indexes are the key to query performance. Without indexes, databases must scan every row. With proper indexing, queries can be orders of magnitude faster. This guide covers indexing fundamentals and optimization.
How Indexes Work
Without Index
-- Full table scan
SELECT * FROM users WHERE email = '[email protected]';
-- Scans every row in the table
With Index
-- Index seek
CREATE INDEX idx_users_email ON users(email);
-- Database jumps directly to matching rows
Trade-offs
- Faster reads: Queries speed up dramatically
- Slower writes: INSERT/UPDATE must update index
- Storage: Indexes consume disk space
- Maintenance: Indexes need occasional rebuilding
Types of Indexes
B-Tree Indexes (Default)
Most common index type. Balanced tree structure.
-- PostgreSQL
CREATE INDEX idx_users_email ON users(email);
-- MySQL
CREATE INDEX idx_users_email ON users(email);
Best for:
- Equality (=)
- Range (> < >= <=)
- Sorting (ORDER BY)
- Prefix matching (LIKE ‘abc%’)
Hash Indexes
-- PostgreSQL
CREATE INDEX idx_users_email ON users USING HASH(email);
Best for:
- Exact equality only
- Not good for ranges or sorting
GIN Indexes
For full-text search and arrays:
-- PostgreSQL
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);
GiST Indexes
For geospatial and range types:
CREATE INDEX idx_locations ON locations USING GIST(coordinates);
Composite Indexes
Creating Composite Indexes
CREATE INDEX idx_orders_user_status
ON orders(user_id, status);
Column Order Matters
For queries:
-- This uses the index
SELECT * FROM orders WHERE user_id = 1;
-- This uses the index
SELECT * FROM orders WHERE user_id = 1 AND status = 'active';
-- This does NOT use the index effectively
SELECT * FROM orders WHERE status = 'active';
Rule: Put equality conditions first, then range conditions.
Covering Indexes
-- PostgreSQL - Covering index
CREATE INDEX idx_users_email ON users(email) INCLUDE (name, created_at);
-- MySQL
CREATE INDEX idx_users_email ON users(email, name, created_at);
Indexing Strategies
What to Index
Index columns in:
- WHERE clauses
- JOIN conditions
- ORDER BY
- GROUP BY
- DISTINCT
What NOT to Index
- Small tables
- Columns with low cardinality (few unique values)
- Columns rarely queried
- Frequently updated columns
Identifying Missing Indexes
PostgreSQL:
-- Check slow queries
SELECT
query,
calls,
mean_time,
total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
MySQL:
-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;
-- Check EXPLAIN
EXPLAIN SELECT * FROM users WHERE email = '[email protected]';
Query Optimization
Using EXPLAIN
-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 1
AND status = 'active'
ORDER BY created_at DESC;
Key terms:
- Seq Scan: Full table scan (bad)
- Index Scan: Uses index (good)
- Index Only Scan: Best (no table access)
Partial Indexes
-- Only index active orders
CREATE INDEX idx_active_orders ON orders(created_at)
WHERE status = 'active';
Expression Indexes
-- Index lowercased email
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
-- Use in query
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
Index Maintenance
Analyzing Index Usage
-- PostgreSQL
SELECT
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
Finding Unused Indexes
-- PostgreSQL
SELECT
indexrelname,
idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%pkey%';
Rebuilding Indexes
-- PostgreSQL - REINDEX
REINDEX INDEX idx_users_email;
-- MySQL - OPTIMIZE TABLE
OPTIMIZE TABLE users;
Common Patterns
UUID Primary Keys vs Auto-Increment
Auto-increment:
-- Fast inserts
-- Sequential inserts
-- Easy pagination
UUID:
-- Distributed systems
-- Security (non-guessable)
-- Need NEWID() or gen_random_uuid()
-- Performance tip: Use sequential UUID (UUID v7)
Composite Keys
-- Junction table
CREATE TABLE order_items (
order_id BIGINT REFERENCES orders(id),
product_id BIGINT REFERENCES products(id),
quantity INT,
PRIMARY KEY (order_id, product_id)
);
Query Patterns to Avoid
Leading Wildcards
-- Bad - can't use index
SELECT * FROM users WHERE name LIKE '%john%';
-- Good - can use index
SELECT * FROM users WHERE name LIKE 'john%';
Functions on Columns
-- Bad - can't use index
SELECT * FROM users WHERE YEAR(created_at) = 2026;
-- Good - can use index
SELECT * FROM users WHERE created_at >= '2026-01-01'
AND created_at < '2027-01-01';
OR Conditions
-- Bad - often results in full scan
SELECT * FROM users WHERE email = '[email protected]' OR phone = '123';
-- Better - UNION
SELECT * FROM users WHERE email = '[email protected]'
UNION
SELECT * FROM users WHERE phone = '123';
##SQL
PostgreSQL
- Default B-tree
- Excellent JSON support (jsonb)
- Partial indexes
- Expression indexes
- Covering indexes (INCLUDE)
- Vector indexes (pgvector) for AI/RAG
MySQL
- Default B-tree (InnoDB)
- Full-text indexes
- Spatial indexes
- Composite limited to 16 columns
Vector Indexes for AI and RAG
With the rise of AI applications, vector indexes have become essential for similarity search and Retrieval-Augmented Generation (RAG) pipelines.
pgvector in PostgreSQL
-- Enable pgvector extension
CREATE EXTENSION vector;
-- Create vector column
ALTER TABLE documents ADD COLUMN embedding vector(1536);
-- Create HNSW index for approximate nearest neighbor search
CREATE INDEX idx_documents_embedding
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Query for similar documents
SELECT id, content,
1 - (embedding <=> $query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> $query_embedding
LIMIT 5;
Index Types for Vectors
-- HNSW (Hierarchical Navigable Small World)
-- Best for: High accuracy, faster search
CREATE INDEX idx ON table USING hnsw (embedding vector_cosine_ops);
-- IVFFlat (Inverted File with Flattened vectors)
-- Best for: Larger datasets, faster build time
CREATE INDEX idx ON table USING ivfflat (embedding vector_cosine_ops);
Vector Index Best Practices
- Dimension matching: Ensure embeddings match your model dimensions (1536 for OpenAI, 768 for BERT)
- Normalization: Normalize vectors for cosine similarity
- Metric choice:
vector_cosine_opsfor cosine,vector_l2_opsfor Euclidean distance - Filter combination: Combine vector search with regular indexes for filtered queries
Cloud-Native Database Indexing
Amazon Aurora
Aurora automatically manages indexes but understanding its behavior helps:
-- Aurora MySQL: Monitor index usage
SELECT * FROM INFORMATION_SCHEMA.STATISTICS
WHERE TABLE_SCHEMA = 'mydb';
-- Aurora PostgreSQL: Use pg_stat_statements
SELECT query, calls, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC;
Google Cloud SQL and Azure Database
-- Cloud SQL PostgreSQL: Recommended settings
-- Let cloud handle most maintenance
-- Focus on query optimization
-- Create indexes during off-peak hours
CREATE INDEX CONCURRENTLY idx_orders_date ON orders(created_at);
Managed Index Strategies
-- Partial indexes for time-series data
CREATE INDEX idx_logs_2026
ON logs(timestamp)
WHERE created_at >= '2026-01-01';
-- Partition-aware indexes
CREATE INDEX ON orders_2026(user_id, status);
CREATE INDEX ON orders_2025(user_id, status);
Advanced Indexing Patterns
Function-Based Indexes
-- Index computed values
CREATE INDEX idx_users_domain ON users((LOWER(email)));
-- Use in queries
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- Index JSON fields (PostgreSQL)
CREATE INDEX idx_orders_data ON orders USING gin((data->'customer'));
-- Query JSON
SELECT * FROM orders WHERE data->>'status' = 'pending';
Index-Only Scans and Covering Indexes
-- PostgreSQL covering index with INCLUDE
CREATE INDEX idx_orders_covering
ON orders(user_id, status)
INCLUDE (total_amount, created_at);
-- This query never touches the table
SELECT user_id, total_amount
FROM orders
WHERE user_id = 1
AND status = 'active';
Reverse Indexes for String Patterns
-- PostgreSQL: Reverse function index for suffix searches
CREATE INDEX idx_users_email_reverse
ON users((reverse(email)));
-- Find all @example.com users
SELECT * FROM users
WHERE reverse(email) LIKE reverse('%@example.com');
Index Performance Monitoring
Real-Time Monitoring
-- PostgreSQL: Live query analysis
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY duration DESC;
-- MySQL: Show processlist
SHOW PROCESSLIST;
Index Hit Ratio
-- PostgreSQL
SELECT
sum(idx_blks_hit) * 100.0 /
nullif(sum(idx_blks_hit + idx_blks_read), 0) AS hit_ratio
FROM pg_statio_user_indexes;
-- Target: > 95% is good
Index Size Analysis
-- PostgreSQL
SELECT
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
indexrelname
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;
Common Performance Pitfalls
Over-Indexing
-- Bad: Too many indexes
CREATE INDEX idx1 ON table(col1);
CREATE INDEX idx2 ON table(col2);
CREATE INDEX idx3 ON table(col3);
CREATE INDEX idx4 ON table(col1, col2);
CREATE INDEX idx5 ON table(col2, col3);
-- Better: Few strategic indexes
CREATE INDEX idx_composite ON table(col1, col2, col3);
Ignoring Index Column Order
-- Query: WHERE status = 'active' AND created_at > '2026-01-01'
-- Wrong order: Index on (created_at, status)
-- Right order: Index on (status, created_at)
CREATE INDEX idx_orders_status_date
ON orders(status, created_at);
Not Using EXPLAIN
-- Always analyze your queries
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE user_id = 1
AND status = 'active';
-- Look for:
-- - Seq Scan (bad for large tables)
-- - Index Scan (good)
-- - Index Only Scan (best)
-- - Actual time vs planned time
Indexing for Different Workloads
OLTP (Transactional)
-- Focus on:
-- - Primary key indexes
-- - Foreign key indexes
-- - WHERE clause columns
-- Keep indexes narrow
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
OLAP (Analytical)
-- Focus on:
-- - Join columns
-- - GROUP BY columns
-- - Date ranges
-- Wider indexes acceptable
CREATE INDEX idx_sales_analytics
ON sales(region, product_category, sale_date, amount);
Hybrid (HTAP)
-- Balance between read and write
-- Consider:
-- - Partial indexes for hot/cold data
-- - Composite indexes covering common queries
-- - Regular index maintenance
Conclusion
Indexing is both art and science. Start with primary keys and foreign keys, then add indexes based on query patterns. Use EXPLAIN to verify, monitor for unused indexes, and remember: more isn’t always better.
Key takeaways for 2026:
- Vector indexes are essential for AI/RAG applications
- Cloud databases handle more maintenance automatically
- Partial indexes optimize for specific query patterns
- Monitor continuously - indexes that were good may become obsolete
- Test in production-like environments - staging may differ from production
Resources
- Use The Index, Luke
- PostgreSQL Indexes
- MySQL Indexes
- pgvector Documentation
- Aurora Indexing Best Practices
Comments