Understanding Database Indexing

Introduction

Database indexes are the key to query performance. Without indexes, databases must scan every row. With proper indexing, queries can be orders of magnitude faster. This guide covers indexing fundamentals and optimization.

How Indexes Work

Without Index

-- Full table scan
SELECT * FROM users WHERE email = '[email protected]';
-- Scans every row in the table

With Index

-- Index seek
CREATE INDEX idx_users_email ON users(email);
-- Database jumps directly to matching rows

Trade-offs

Faster reads: Queries speed up dramatically
Slower writes: INSERT/UPDATE must update index
Storage: Indexes consume disk space
Maintenance: Indexes need occasional rebuilding

Types of Indexes

B-Tree Indexes (Default)

Most common index type. Balanced tree structure.

-- PostgreSQL
CREATE INDEX idx_users_email ON users(email);

-- MySQL
CREATE INDEX idx_users_email ON users(email);

Best for:

Equality (=)
Range (> < >= <=)
Sorting (ORDER BY)
Prefix matching (LIKE ‘abc%’)

Hash Indexes

-- PostgreSQL
CREATE INDEX idx_users_email ON users USING HASH(email);

Best for:

Exact equality only
Not good for ranges or sorting

GIN Indexes

For full-text search and arrays:

-- PostgreSQL
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);

GiST Indexes

For geospatial and range types:

CREATE INDEX idx_locations ON locations USING GIST(coordinates);

Composite Indexes

Creating Composite Indexes

CREATE INDEX idx_orders_user_status 
ON orders(user_id, status);

Column Order Matters

For queries:

-- This uses the index
SELECT * FROM orders WHERE user_id = 1;

-- This uses the index  
SELECT * FROM orders WHERE user_id = 1 AND status = 'active';

-- This does NOT use the index effectively
SELECT * FROM orders WHERE status = 'active';

Rule: Put equality conditions first, then range conditions.

Covering Indexes

-- PostgreSQL - Covering index
CREATE INDEX idx_users_email ON users(email) INCLUDE (name, created_at);

-- MySQL
CREATE INDEX idx_users_email ON users(email, name, created_at);

Indexing Strategies

What to Index

Index columns in:

WHERE clauses
JOIN conditions
ORDER BY
GROUP BY
DISTINCT

What NOT to Index

Small tables
Columns with low cardinality (few unique values)
Columns rarely queried
Frequently updated columns

Identifying Missing Indexes

PostgreSQL:

-- Check slow queries
SELECT 
    query,
    calls,
    mean_time,
    total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

MySQL:

-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;

-- Check EXPLAIN
EXPLAIN SELECT * FROM users WHERE email = '[email protected]';

Query Optimization

Using EXPLAIN

-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 1 
AND status = 'active'
ORDER BY created_at DESC;

Key terms:

Seq Scan: Full table scan (bad)
Index Scan: Uses index (good)
Index Only Scan: Best (no table access)

Partial Indexes

-- Only index active orders
CREATE INDEX idx_active_orders ON orders(created_at) 
WHERE status = 'active';

Expression Indexes

-- Index lowercased email
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Use in query
SELECT * FROM users WHERE LOWER(email) = '[email protected]';

Index Maintenance

Analyzing Index Usage

-- PostgreSQL
SELECT 
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

Finding Unused Indexes

-- PostgreSQL
SELECT 
    indexrelname,
    idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%pkey%';

Rebuilding Indexes

-- PostgreSQL - REINDEX
REINDEX INDEX idx_users_email;

-- MySQL - OPTIMIZE TABLE
OPTIMIZE TABLE users;

Common Patterns

UUID Primary Keys vs Auto-Increment

Auto-increment:

-- Fast inserts
-- Sequential inserts
-- Easy pagination

UUID:

-- Distributed systems
-- Security (non-guessable)
-- Need NEWID() or gen_random_uuid()

-- Performance tip: Use sequential UUID (UUID v7)

Composite Keys

-- Junction table
CREATE TABLE order_items (
    order_id BIGINT REFERENCES orders(id),
    product_id BIGINT REFERENCES products(id),
    quantity INT,
    PRIMARY KEY (order_id, product_id)
);

Query Patterns to Avoid

Leading Wildcards

-- Bad - can't use index
SELECT * FROM users WHERE name LIKE '%john%';

-- Good - can use index
SELECT * FROM users WHERE name LIKE 'john%';

Functions on Columns

-- Bad - can't use index
SELECT * FROM users WHERE YEAR(created_at) = 2026;

-- Good - can use index
SELECT * FROM users WHERE created_at >= '2026-01-01' 
AND created_at < '2027-01-01';

OR Conditions

-- Bad - often results in full scan
SELECT * FROM users WHERE email = '[email protected]' OR phone = '123';

-- Better - UNION
SELECT * FROM users WHERE email = '[email protected]'
UNION
SELECT * FROM users WHERE phone = '123';

##SQL

PostgreSQL

Default B-tree
Excellent JSON support (jsonb)
Partial indexes
Expression indexes
Covering indexes (INCLUDE)
Vector indexes (pgvector) for AI/RAG

MySQL

Default B-tree (InnoDB)
Full-text indexes
Spatial indexes
Composite limited to 16 columns

Vector Indexes for AI and RAG

With the rise of AI applications, vector indexes have become essential for similarity search and Retrieval-Augmented Generation (RAG) pipelines.

pgvector in PostgreSQL

-- Enable pgvector extension
CREATE EXTENSION vector;

-- Create vector column
ALTER TABLE documents ADD COLUMN embedding vector(1536);

-- Create HNSW index for approximate nearest neighbor search
CREATE INDEX idx_documents_embedding 
ON documents 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Query for similar documents
SELECT id, content, 
       1 - (embedding <=> $query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> $query_embedding
LIMIT 5;

Index Types for Vectors

-- HNSW (Hierarchical Navigable Small World)
-- Best for: High accuracy, faster search
CREATE INDEX idx ON table USING hnsw (embedding vector_cosine_ops);

-- IVFFlat (Inverted File with Flattened vectors)
-- Best for: Larger datasets, faster build time
CREATE INDEX idx ON table USING ivfflat (embedding vector_cosine_ops);

Vector Index Best Practices

Dimension matching: Ensure embeddings match your model dimensions (1536 for OpenAI, 768 for BERT)
Normalization: Normalize vectors for cosine similarity
Metric choice: vector_cosine_ops for cosine, vector_l2_ops for Euclidean distance
Filter combination: Combine vector search with regular indexes for filtered queries

Cloud-Native Database Indexing

Amazon Aurora

Aurora automatically manages indexes but understanding its behavior helps:

-- Aurora MySQL: Monitor index usage
SELECT * FROM INFORMATION_SCHEMA.STATISTICS 
WHERE TABLE_SCHEMA = 'mydb';

-- Aurora PostgreSQL: Use pg_stat_statements
SELECT query, calls, mean_time 
FROM pg_stat_statements 
ORDER BY mean_time DESC;

Google Cloud SQL and Azure Database

-- Cloud SQL PostgreSQL: Recommended settings
-- Let cloud handle most maintenance
-- Focus on query optimization

-- Create indexes during off-peak hours
CREATE INDEX CONCURRENTLY idx_orders_date ON orders(created_at);

Managed Index Strategies

-- Partial indexes for time-series data
CREATE INDEX idx_logs_2026 
ON logs(timestamp) 
WHERE created_at >= '2026-01-01';

-- Partition-aware indexes
CREATE INDEX ON orders_2026(user_id, status);
CREATE INDEX ON orders_2025(user_id, status);

Advanced Indexing Patterns

Function-Based Indexes

-- Index computed values
CREATE INDEX idx_users_domain ON users((LOWER(email)));

-- Use in queries
SELECT * FROM users WHERE LOWER(email) = '[email protected]';

-- Index JSON fields (PostgreSQL)
CREATE INDEX idx_orders_data ON orders USING gin((data->'customer'));

-- Query JSON
SELECT * FROM orders WHERE data->>'status' = 'pending';

Index-Only Scans and Covering Indexes

-- PostgreSQL covering index with INCLUDE
CREATE INDEX idx_orders_covering 
ON orders(user_id, status) 
INCLUDE (total_amount, created_at);

-- This query never touches the table
SELECT user_id, total_amount 
FROM orders 
WHERE user_id = 1 
AND status = 'active';

Reverse Indexes for String Patterns

-- PostgreSQL: Reverse function index for suffix searches
CREATE INDEX idx_users_email_reverse 
ON users((reverse(email)));

-- Find all @example.com users
SELECT * FROM users 
WHERE reverse(email) LIKE reverse('%@example.com');

Index Performance Monitoring

Real-Time Monitoring

-- PostgreSQL: Live query analysis
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE state != 'idle'
AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY duration DESC;

-- MySQL: Show processlist
SHOW PROCESSLIST;

Index Hit Ratio

-- PostgreSQL
SELECT 
    sum(idx_blks_hit) * 100.0 / 
    nullif(sum(idx_blks_hit + idx_blks_read), 0) AS hit_ratio
FROM pg_statio_user_indexes;

-- Target: > 95% is good

Index Size Analysis

-- PostgreSQL
SELECT 
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    indexrelname
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;

Common Performance Pitfalls

Over-Indexing

-- Bad: Too many indexes
CREATE INDEX idx1 ON table(col1);
CREATE INDEX idx2 ON table(col2);
CREATE INDEX idx3 ON table(col3);
CREATE INDEX idx4 ON table(col1, col2);
CREATE INDEX idx5 ON table(col2, col3);

-- Better: Few strategic indexes
CREATE INDEX idx_composite ON table(col1, col2, col3);

Ignoring Index Column Order

-- Query: WHERE status = 'active' AND created_at > '2026-01-01'
-- Wrong order: Index on (created_at, status)
-- Right order: Index on (status, created_at)

CREATE INDEX idx_orders_status_date 
ON orders(status, created_at);

Not Using EXPLAIN

-- Always analyze your queries
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders 
WHERE user_id = 1 
AND status = 'active';

-- Look for:
-- - Seq Scan (bad for large tables)
-- - Index Scan (good)
-- - Index Only Scan (best)
-- - Actual time vs planned time

Indexing for Different Workloads

OLTP (Transactional)

-- Focus on:
-- - Primary key indexes
-- - Foreign key indexes
-- - WHERE clause columns

-- Keep indexes narrow
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);

OLAP (Analytical)

-- Focus on:
-- - Join columns
-- - GROUP BY columns
-- - Date ranges

-- Wider indexes acceptable
CREATE INDEX idx_sales_analytics 
ON sales(region, product_category, sale_date, amount);

Hybrid (HTAP)

-- Balance between read and write
-- Consider:
-- - Partial indexes for hot/cold data
-- - Composite indexes covering common queries
-- - Regular index maintenance

Conclusion

Indexing is both art and science. Start with primary keys and foreign keys, then add indexes based on query patterns. Use EXPLAIN to verify, monitor for unused indexes, and remember: more isn’t always better.

Key takeaways for 2026:

Vector indexes are essential for AI/RAG applications
Cloud databases handle more maintenance automatically
Partial indexes optimize for specific query patterns
Monitor continuously - indexes that were good may become obsolete
Test in production-like environments - staging may differ from production