Query Optimization: Indexing Strategies for 1M+ Records

Introduction

Query optimization is the difference between a responsive application and a sluggish one. As datasets grow to millions or billions of records, inefficient queries become bottlenecks that no amount of hardware can fix. Understanding indexing strategies, query analysis, and optimization techniques is essential for building scalable applications.

This comprehensive guide covers query optimization and indexing strategies for large-scale databases.

Core Concepts & Terminology

Index

Data structure that enables fast lookup of rows based on column values.

Query Plan

Sequence of steps database uses to execute a query.

Cardinality

Number of unique values in a column (high cardinality = many unique values).

Selectivity

Percentage of rows matching a condition (high selectivity = fewer rows).

Index Scan

Reading entire index to find matching rows.

Index Seek

Using index to directly find matching rows.

Table Scan

Reading entire table to find matching rows.

Join

Combining rows from multiple tables.

Aggregation

Combining multiple rows into summary statistics.

Covering Index

Index containing all columns needed for a query.

Composite Index

Index on multiple columns.

Index Types

B-Tree Index (Most Common)

-- Create B-tree index
CREATE INDEX idx_users_email ON users(email);

-- Composite B-tree index
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);

-- Partial index (only indexed rows matching condition)
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';

-- Covering index (includes all needed columns)
CREATE INDEX idx_users_email_name ON users(email) INCLUDE (name, created_at);

Hash Index

-- Hash index for equality comparisons
CREATE INDEX idx_user_id_hash ON users USING HASH (user_id);

-- Good for: Exact matches
-- Bad for: Range queries, sorting

Full-Text Index

-- Full-text index for text search
CREATE FULLTEXT INDEX idx_articles_content ON articles(title, content);

-- Query using full-text search
SELECT * FROM articles WHERE MATCH(title, content) AGAINST('database optimization' IN BOOLEAN MODE);

Spatial Index

-- Spatial index for geographic queries
CREATE SPATIAL INDEX idx_locations_geo ON locations(coordinates);

-- Query nearby locations
SELECT * FROM locations WHERE ST_Distance(coordinates, POINT(40.7128, -74.0060)) < 1000;

Query Analysis

EXPLAIN and EXPLAIN ANALYZE

-- Basic EXPLAIN (shows plan without executing)
EXPLAIN SELECT * FROM users WHERE email = '[email protected]';

-- EXPLAIN ANALYZE (executes and shows actual stats)
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]';

-- Detailed analysis
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) 
SELECT * FROM users WHERE email = '[email protected]';

-- Output interpretation:
-- - Seq Scan: Full table scan (slow)
-- - Index Scan: Using index (fast)
-- - Rows: Estimated vs actual rows
-- - Buffers: Cache hits vs disk reads

Query Performance Metrics

-- Identify slow queries
SELECT
    query,
    calls,
    total_time,
    mean_time,
    max_time
FROM pg_stat_statements
WHERE mean_time > 100  -- Queries averaging > 100ms
ORDER BY mean_time DESC
LIMIT 20;

-- Check index usage
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- Find unused indexes
SELECT
    schemaname,
    tablename,
    indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

Optimization Techniques

1. Index Selection

-- ❌ INEFFICIENT: No index
SELECT * FROM orders WHERE customer_id = 123;

-- ✅ OPTIMIZED: Index on customer_id
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
SELECT * FROM orders WHERE customer_id = 123;

-- ❌ INEFFICIENT: Index on wrong column
CREATE INDEX idx_orders_amount ON orders(amount);
SELECT * FROM orders WHERE customer_id = 123;  -- Still slow

-- ✅ OPTIMIZED: Composite index
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);
SELECT * FROM orders WHERE customer_id = 123 ORDER BY order_date DESC;

2. Query Rewriting

-- ❌ INEFFICIENT: Function on indexed column
SELECT * FROM users WHERE YEAR(created_at) = 2025;

-- ✅ OPTIMIZED: Range query
SELECT * FROM users WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';

-- ❌ INEFFICIENT: OR with different columns
SELECT * FROM users WHERE email = '[email protected]' OR phone = '555-1234';

-- ✅ OPTIMIZED: UNION
SELECT * FROM users WHERE email = '[email protected]'
UNION
SELECT * FROM users WHERE phone = '555-1234';

-- ❌ INEFFICIENT: NOT IN with subquery
SELECT * FROM orders WHERE customer_id NOT IN (SELECT id FROM customers WHERE status = 'inactive');

-- ✅ OPTIMIZED: LEFT JOIN
SELECT o.* FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id AND c.status = 'inactive'
WHERE c.id IS NULL;

3. Join Optimization

-- ❌ INEFFICIENT: Multiple joins without indexes
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'US';

-- ✅ OPTIMIZED: Proper indexes
CREATE INDEX idx_customers_country ON customers(country);
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_product_id ON orders(product_id);

-- ✅ OPTIMIZED: Join order matters
SELECT * FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'US';

-- ❌ INEFFICIENT: Cartesian product
SELECT * FROM orders o, customers c WHERE o.customer_id = c.id;

-- ✅ OPTIMIZED: Explicit join
SELECT * FROM orders o INNER JOIN customers c ON o.customer_id = c.id;

4. Aggregation Optimization

-- ❌ INEFFICIENT: Aggregating all rows
SELECT COUNT(*) FROM orders;

-- ✅ OPTIMIZED: Use approximate count (if acceptable)
SELECT reltuples FROM pg_class WHERE relname = 'orders';

-- ❌ INEFFICIENT: GROUP BY without index
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;

-- ✅ OPTIMIZED: Index on GROUP BY column
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;

-- ❌ INEFFICIENT: HAVING without WHERE
SELECT customer_id, COUNT(*) as order_count FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 10;

-- ✅ OPTIMIZED: Filter before grouping
SELECT customer_id, COUNT(*) as order_count FROM orders
WHERE created_at >= '2025-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 10;

Indexing Strategies for Large Datasets

Partitioned Indexes

-- Create partitioned table
CREATE TABLE orders (
    order_id BIGSERIAL,
    customer_id INTEGER,
    order_date DATE,
    amount DECIMAL(10, 2)
) PARTITION BY RANGE (YEAR(order_date));

-- Create partitions
CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM (2024) TO (2025);

CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM (2025) TO (2026);

-- Create indexes on partitions
CREATE INDEX idx_orders_2024_customer ON orders_2024(customer_id);
CREATE INDEX idx_orders_2025_customer ON orders_2025(customer_id);

-- Queries benefit from partition pruning
EXPLAIN SELECT * FROM orders WHERE order_date >= '2025-01-01';

Covering Indexes

-- ❌ INEFFICIENT: Index doesn't cover all needed columns
CREATE INDEX idx_users_email ON users(email);
SELECT email, name, created_at FROM users WHERE email = '[email protected]';
-- Requires: Index lookup + table lookup

-- ✅ OPTIMIZED: Covering index
CREATE INDEX idx_users_email_covering ON users(email) INCLUDE (name, created_at);
SELECT email, name, created_at FROM users WHERE email = '[email protected]';
-- Requires: Index lookup only (no table access)

Partial Indexes

-- ❌ INEFFICIENT: Index includes inactive records
CREATE INDEX idx_users_email ON users(email);

-- ✅ OPTIMIZED: Partial index for active users only
CREATE INDEX idx_active_users_email ON users(email) WHERE status = 'active';

-- Query using partial index
SELECT * FROM users WHERE email = '[email protected]' AND status = 'active';

Real-World Optimization Examples

E-Commerce Order Queries

-- Common queries
-- 1. Get customer's recent orders
SELECT * FROM orders WHERE customer_id = 123 ORDER BY order_date DESC LIMIT 10;

-- 2. Get orders by status
SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at DESC;

-- 3. Get orders in date range
SELECT * FROM orders WHERE order_date >= '2025-01-01' AND order_date < '2025-02-01';

-- Optimal indexes
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);
CREATE INDEX idx_orders_status_date ON orders(status, created_at DESC);
CREATE INDEX idx_orders_date ON orders(order_date);

-- Verify with EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123 ORDER BY order_date DESC LIMIT 10;

Analytics Queries

-- Common analytics queries
-- 1. Daily revenue
SELECT DATE(order_date) as day, SUM(amount) as revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY DATE(order_date);

-- 2. Customer lifetime value
SELECT customer_id, COUNT(*) as orders, SUM(amount) as total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC;

-- 3. Product performance
SELECT product_id, COUNT(*) as sales, SUM(amount) as revenue
FROM order_items
GROUP BY product_id
ORDER BY revenue DESC;

-- Optimal indexes
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_order_items_product ON order_items(product_id);

-- Consider materialized views for complex aggregations
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT DATE(order_date) as day, SUM(amount) as revenue
FROM orders
GROUP BY DATE(order_date);

CREATE INDEX idx_daily_revenue_day ON daily_revenue(day);

Monitoring and Maintenance

Index Maintenance

-- Rebuild fragmented indexes
REINDEX INDEX idx_users_email;

-- Analyze table statistics
ANALYZE users;

-- Vacuum to reclaim space
VACUUM users;

-- Full maintenance
VACUUM ANALYZE users;

-- Check index size
SELECT
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

Performance Monitoring

-- Monitor query performance over time
SELECT
    query,
    calls,
    total_time,
    mean_time,
    stddev_time,
    max_time
FROM pg_stat_statements
WHERE query LIKE '%orders%'
ORDER BY mean_time DESC;

-- Check cache hit ratio
SELECT
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;

-- Monitor index usage
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

Best Practices & Common Pitfalls

Best Practices

Analyze Before Optimizing: Use EXPLAIN ANALYZE
Index Strategically: Index columns used in WHERE, JOIN, ORDER BY
Avoid Over-Indexing: Too many indexes slow writes
Use Composite Indexes: Combine related columns
Monitor Performance: Track query times regularly
Maintain Indexes: Rebuild fragmented indexes
Test Changes: Test optimizations before production
Document Decisions: Document why indexes exist
Regular Maintenance: VACUUM and ANALYZE regularly
Plan for Growth: Design indexes for future scale

Common Pitfalls

Missing Indexes: Queries without proper indexes
Over-Indexing: Too many indexes on same table
Wrong Index Type: Using wrong index for query pattern
Stale Statistics: Not running ANALYZE regularly
Fragmented Indexes: Not rebuilding fragmented indexes
Unused Indexes: Keeping indexes that aren’t used
Poor Join Order: Joining tables in inefficient order
Functions on Indexed Columns: Preventing index usage
Inadequate Monitoring: Not tracking performance
Premature Optimization: Optimizing before profiling

External Resources

Documentation

Tools

Learning Resources

Conclusion

Query optimization is both art and science. Success requires understanding your data, analyzing query plans, and strategically applying indexing and query rewriting techniques. Start with EXPLAIN ANALYZE, identify bottlenecks, and optimize based on real-world usage patterns.

Remember: premature optimization is the root of all evil, but ignoring performance is worse. Profile first, optimize strategically, and monitor continuously.

Optimized queries are the foundation of responsive applications.