Introduction
Database performance is the backbone of application responsiveness. When users experience slow queries, it cascades into poor user experience, lost revenue, and frustrated developers. Database indexing is the most powerful tool in your optimization toolkit, yet many developers only scratch the surface of what’s possible.
In this comprehensive guide, we’ll explore indexing strategies that can transform query performance from seconds to milliseconds. We’ll cover fundamental concepts, advanced techniques, and practical patterns you can apply immediately to your projects.
Understanding How Databases Read Data
The Table Scan Problem
Before understanding indexes, you must understand the problem they solve. When you execute a query without an index, the database performs a table scanโreading every single row in the table to find matching records.
Consider a table with 10 million user records:
SELECT * FROM users WHERE email = '[email protected]';
Without an index, the database reads all 10 million rows sequentially. If each row is 500 bytes and the disk can read at 100 MB/s, this takes approximately 50 seconds. This is unacceptable for production systems.
How Indexes Solve This
An index is a separate data structure that organizes a subset of your data in a way that makes lookups fastโtypically O(log n) instead of O(n). Think of it like the index at the back of a book: instead of reading every page to find a topic, you consult the index and jump directly to the relevant pages.
Most databases use B-tree (balanced tree) indexes by default, which maintain sorted order and allow fast lookups, range queries, and ordering operations.
Types of Database Indexes
B-Tree Indexes
B-tree (balanced tree) is the most common index type in relational databases. It maintains data in a sorted tree structure, enabling efficient lookups, range queries, and ordered access.
-- Create a B-tree index on email column
CREATE INDEX idx_users_email ON users(email);
-- B-tree indexes support these operations efficiently
SELECT * FROM users WHERE email = '[email protected]'; -- Exact match
SELECT * FROM users WHERE age >= 18 AND age <= 35; -- Range query
SELECT * FROM users ORDER BY created_at DESC; -- Ordered results
B-tree indexes work best for:
- Columns with high cardinality (many unique values)
- Equality queries (=, !=)
- Range queries (<, >, BETWEEN, LIKE ‘prefix%’)
- ORDER BY clauses
Hash Indexes
Hash indexes use a hash function to map values to bucket locations. They provide O(1) lookup time for exact matches but cannot support range queries.
-- Create a hash index (MySQL/MariaDB)
CREATE INDEX idx_users_email_hash ON users(email) USING HASH;
-- PostgreSQL uses hash indexes differently
CREATE INDEX idx_users_email_hash ON users USING hash(email);
Use hash indexes when:
- You only need exact match queries
- The index fits entirely in memory
- The column has many duplicate values (low cardinality)
GIN Indexes (Generalized Inverted Index)
GIN indexes are designed for composite data types like arrays, JSONB, and full-text search.
-- PostgreSQL: Index an array column
CREATE INDEX idx_products_tags ON products USING GIN(tags);
-- Query array contents efficiently
SELECT * FROM products WHERE tags @> '{"electronics"}';
-- Index JSONB data
CREATE INDEX idx_orders_metadata ON orders USING GIN(metadata);
-- Full-text search index
CREATE INDEX idx_articles_content
ON articles USING GIN(to_tsvector('english', content));
GiST Indexes (Generalized Search Tree)
GiST indexes are flexible, allowing you to implement custom indexing strategies for geographical data, range types, and more.
-- Index geographical coordinates
CREATE INDEX idx_locations_geo ON locations USING GIST(coordinates);
-- Find locations within a bounding box
SELECT * FROM locations
WHERE coordinates && ST_MakeEnvelope(-122.5, 37.5, -122.0, 38.0);
-- Range types
CREATE INDEX idx_reservations_dates ON reservations USING GIST(daterange);
Bitmap Indexes
Bitmap indexes use bit arrays to represent values, excellent for columns with low cardinality in data warehouses.
-- Oracle bitmap index
CREATE BITMAP INDEX idx_sales_region ON sales(region);
-- Efficient for OLAP workloads with repeated values
SELECT COUNT(*) FROM sales WHERE region = 'US-West';
Composite Indexes
Creating Multi-Column Indexes
Composite indexes combine multiple columns into a single index structure, powerful when querying multiple columns together.
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_date DESC);
Understanding Index Column Order
Column order in composite indexes is critical. The leftmost principle states that the index can only be used effectively for conditions on the leftmost columns.
Given the index (a, b, c):
| Query | Uses Index? |
|---|---|
| WHERE a = 1 | Yes (full) |
| WHERE a = 1 AND b = 2 | Yes (full) |
| WHERE a = 1 AND b = 2 AND c = 3 | Yes (full) |
| WHERE b = 2 | No |
| WHERE b = 2 AND c = 3 | No |
| WHERE a = 1 AND c = 3 | Partial (only a) |
Covering Indexes
A covering index includes all columns needed by a query, allowing the database to satisfy the query entirely from the index without accessing the table.
-- Without covering index: 2 index lookups
SELECT name, email FROM users WHERE user_id = 123;
-- With covering index: 1 index lookup
CREATE INDEX idx_users_covering ON users(user_id) INCLUDE (name, email);
Indexing Strategies by Use Case
High-Cardinality String Columns
For columns with many unique values (like email, UUIDs), B-tree indexes work well:
-- Email is high cardinality
CREATE INDEX idx_users_email ON users(email);
-- Use hash for very long strings in PostgreSQL
CREATE INDEX idx_users_email_hash ON users USING hash(email);
Low-Cardinality Boolean Columns
Boolean columns have only two valuesโindexing them rarely helps:
-- Don't do this
CREATE INDEX idx_users_active ON users(is_active);
-- Instead, use partial indexes
CREATE INDEX idx_users_active ON users(id) WHERE is_active = true;
Time-Series Data
For timestamps, consider time-based partitioning and indexes:
-- Composite index with timestamp
CREATE INDEX idx_events_time_type
ON events(event_time DESC, event_type);
-- Partial index for recent data (hot path)
CREATE INDEX idx_events_recent
ON events(event_time DESC)
WHERE event_time > NOW() - INTERVAL '30 days';
JSON Data
Modern applications often store JSON. Index it properly:
-- PostgreSQL JSONB indexing
CREATE INDEX idx_orders_json ON orders USING GIN(data);
-- Query specific JSON fields
SELECT * FROM orders WHERE data->>'customer_id' = '123';
-- More efficient: expression index
CREATE INDEX idx_orders_customer_id
ON orders((data->>'customer_id'));
SELECT * FROM orders WHERE data->>'customer_id' = '123';
Query Optimization Patterns
Analyzing Query Execution
Always examine how your queries use indexes:
-- PostgreSQL: Explain query plan
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = '[email protected]';
-- MySQL: Show execution plan
EXPLAIN FORMAT=JSON
SELECT * FROM users WHERE email = '[email protected]';
Look for these indicators:
- Seq Scan on table: Table scan, no index used
- Index Scan using idx_xxx: Good, index is being used
- Index Only Scan: Best, covering index
- Bitmap Index Scan + Bitmap Heap: Multiple index usage
Partial Indexes
Create indexes for a subset of rows:
-- Only index active users (likely 90% of queries)
CREATE INDEX idx_users_active_email
ON users(email)
WHERE status = 'active';
-- Index only large orders
CREATE INDEX idx_orders_large
ON orders(order_date DESC)
WHERE total_amount > 10000;
Functional Indexes
Index the result of functions or expressions:
-- Case-insensitive search
CREATE INDEX idx_users_email_lower
ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- Computed columns
CREATE INDEX idx_orders_year
ON orders(EXTRACT(YEAR FROM order_date));
-- Domain-specific: full name search
CREATE INDEX idx_users_name_search
ON users((first_name || ' ' || last_name));
Advanced Indexing Techniques
Index-Organized Tables
Store the entire table in the index structure:
-- Oracle IOT
CREATE TABLE users (
user_id PRIMARY KEY,
email,
name
) ORGANIZATION INDEX;
-- MySQL InnoDB primary key is always an IOT
Reverse Key Indexes
Useful for sequence-generated keys that insert sequentially:
-- Oracle reverse key index
CREATE INDEX idx_orders_id_rev ON orders(order_id) REVERSE;
Prevents index contention at the rightmost block when inserting sequential IDs.
Composite Index Design Patterns
Pattern 1: Equality + Range
-- Queries: WHERE status = 'active' AND created_at > '2024-01-01'
CREATE INDEX idx_orders_status_date
ON orders(status, created_at DESC);
Pattern 2: Multiple Equality Conditions
-- Queries: WHERE category = 'electronics' AND status = 'published'
CREATE INDEX idx_products_cat_status
ON products(category, status);
Pattern 3: Sort-Heavy Queries
-- Queries: WHERE published = true ORDER BY created_at DESC
CREATE INDEX idx_articles_published_date
ON articles(published, created_at DESC);
Clustered vs Non-Clustered Indexes
-- SQL Server: Clustered index determines physical order
CREATE CLUSTERED INDEX idx_users_email ON users(email);
-- Only one clustered index per table
-- Choose wisely based on your most common query pattern
Common Indexing Mistakes
Mistake 1: Indexing Every Column
-- Don't do this
CREATE INDEX idx_users_all ON users(id, name, email, phone, address, created_at);
-- Instead, create targeted indexes based on actual query patterns
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_phone ON users(phone);
Mistake 2: Ignoring Query Patterns
-- If you frequently query:
WHERE LOWER(email) = '[email protected]'
-- Regular index won't help
CREATE INDEX idx_users_email ON users(email);
-- You need a functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
Mistake 3: Not Considering Write Overhead
Every index slows down INSERT, UPDATE, and DELETE operations:
-- 5 indexes = 5x write overhead
-- Balance read performance vs write performance
-- Consider: Do you really need all these indexes?
-- Monitor write latency after adding indexes
Mistake 4: Using VARCHAR as Primary Key
-- Bad: String primary key
CREATE TABLE users (
email VARCHAR(255) PRIMARY KEY, -- Slow inserts, large index
...
);
-- Good: Integer surrogate key
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE, -- Separate unique constraint
...
);
Index Maintenance
Monitoring Index Usage
-- PostgreSQL: Find unused indexes
SELECT
schemaname,
relname,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%pkey%';
Identifying Bloat
-- PostgreSQL: Check index bloat
SELECT
schemaname || '.' || tablename AS table_name,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
(100 * idx_scan / (seq_scan + idx_scan + 1))::INT AS hit_ratio
FROM pg_stat_user_indexes
WHERE idx_scan > 0
ORDER BY hit_ratio ASC;
Rebuilding Indexes
-- PostgreSQL: Rebuilder (concurrent, no lock)
REINDEX INDEX CONCURRENTLY idx_users_email;
-- MySQL: Optimize table (rebuilds indexes)
OPTIMIZE TABLE users;
-- SQL Server: Rebuild index
ALTER INDEX idx_users_email ON users REBUILD;
Database-Specific Indexing
PostgreSQL
-- Partial index
CREATE INDEX idx_users_active ON users(email) WHERE status = 'active';
-- JSON path index
CREATE INDEX idx_orders_customers ON orders((data->>'customer_id'));
-- BRIN index for time-series (much smaller)
CREATE INDEX idx_events_time_brin ON events USING BRIN(event_time);
MySQL
-- Prefix indexes for long strings
CREATE INDEX idx_users_name ON users(name(10));
-- Composite index
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
-- Invisible index (for testing)
CREATE INDEX idx_users_email ON users(email) INVISIBLE;
MongoDB
// Single field index
db.users.createIndex({ email: 1 });
// Compound index
db.orders.createIndex({ customerId: 1, createdAt: -1 });
// Multikey index (arrays)
db.products.createIndex({ tags: 1 });
// Text index
db.articles.createIndex({ content: "text" });
Performance Benchmarks
Impact of Indexing
| Query Type | Without Index | With Index |
|---|---|---|
| Point lookup (1M rows) | ~500ms | ~2ms |
| Range query | ~800ms | ~15ms |
| JOIN with indexed FK | ~2000ms | ~50ms |
| ORDER BY | ~600ms | ~5ms |
Index Size Estimation
-- PostgreSQL: Index size
SELECT pg_size_pretty(pg_relation_size('idx_users_email'));
-- MySQL: Index length
SHOW INDEX FROM users;
-- Rule of thumb: index is 10-30% of table size for B-tree
Best Practices Summary
- Analyze Query Patterns: Use EXPLAIN to understand actual usage
- Start with Primary Keys: Ensure foreign keys are indexed
- Composite Index Column Order: Put equality conditions first, range conditions last
- Use Covering Indexes: Include all SELECTed columns
- Consider Partial Indexes: Index only frequently accessed subsets
- Monitor and Maintain: Remove unused indexes regularly
- Balance Read/Write: More indexes = faster reads, slower writes
- Test in Staging: Benchmark before production deployment
Conclusion
Database indexing is both an art and a science. The right index can reduce query time from seconds to milliseconds, while the wrong index wastes storage and slows down writes.
Start by understanding your actual query patterns, use EXPLAIN to verify index usage, and iterate based on real performance metrics. Remember that indexing is not a one-time taskโcontinuous monitoring and refinement are essential for maintaining optimal database performance as your application evolves.
With these strategies in your toolkit, you’re well-equipped to tackle even the most demanding query optimization challenges.
Comments