Skip to main content

MongoDB Index Operations: Building, Benchmarking & Optimization

Chapter 3: Index Operations

Created: June 12, 2021 18 min read

Introduction

MongoDB relies on indexes to serve queries efficiently. Without an index, MongoDB must scan every document in a collection (a collection scan) to find matching documents. A well-designed index lets MongoDB locate documents in logarithmic time and satisfy sorts without loading documents into memory. This article covers the full lifecycle of MongoDB index operations: how to build indexes, how MongoDB selects them, how to interpret explain output, how to benchmark index performance, and which index type to use for each query pattern.

Index Build Methods

MongoDB provides several strategies for building indexes, each with different availability and performance characteristics.

Hybrid Index Builds (MongoDB 4.2+)

Starting in MongoDB 4.2, all index builds use a hybrid approach that holds an exclusive lock only at the beginning and end of the build process. During the build, MongoDB allows read and write operations on the collection.

// Default index build in 4.2+ — hybrid build, no background option needed
db.movies.createIndex({ title: 1 })

// The {background: true} option is now ignored in 4.2+
// MongoDB always uses the hybrid build
db.movies.createIndex({ title: 1 }, { background: true })

The hybrid build works in three phases:

  1. Start phase: MongoDB takes an exclusive (X) lock on the collection, records a list of all documents currently in the collection, and then releases the lock.
  2. Build phase: MongoDB builds the index on the documents captured in the start phase. Concurrent writes are tracked in a side queue. 3 Commit phase: MongoDB takes another exclusive lock, applies the queued writes, and commits the new index.

This design means MongoDB no longer blocks the database for the duration of large index builds. Applications continue serving reads and writes during the build phase.

Foreground vs Background Builds (Pre-4.2)

Before version 4.2, MongoDB offered two build modes:

// Foreground (default) — blocks all read/write on the database
db.movies.createIndex({ title: 1 })

// Background — allows reads and writes but builds more slowly
db.movies.createIndex({ title: 1 }, { background: true })

Foreground builds held an exclusive lock on the database for the entire build duration. They were faster because MongoDB could dedicate resources to the build, but no other operation could execute on that database.

Background builds yielded the lock periodically to allow other operations to proceed. This kept the database available but the build took longer and could impact system performance due to context switching.

In MongoDB 4.2+, both options produce the same result: the hybrid index build. The background option is accepted but ignored.

Rolling Index Builds for Replica Sets

On a replica set, you can build indexes with zero application downtime by using a rolling index build. The strategy is to build the index on each secondary member individually, then step down the primary and build on the former primary.

# On each secondary node (one at a time):
# 1. Stop the mongod process
# 2. Restart as a standalone on a different port
# 3. Build the index
mongod --port 37017 --dbpath /data/db
mongo --port 37017
> db.movies.createIndex({ title: 1 })
# 4. Shut down standalone and restart as replica set member
# 5. Wait for replication to catch up
# 6. Repeat on the next secondary

After all secondaries have the new index, step down the primary:

# From the primary:
> rs.stepDown()
# Build the index on the new secondary (former primary)

This approach avoids any replica set election during the build and keeps the application fully available.

For replica sets running MongoDB 4.2+, hybrid builds run in the background automatically and do not require rolling builds for availability. Rolling builds still benefit very large collections where the build phase might exhaust system resources.

// On a replica set in 4.2+, you can also use:
db.movies.createIndex({ title: 1 }, { commitQuorum: "majority" })

The commitQuorum option controls how many replica set members must finish building the index before the build is considered complete. This prevents a rollback of the index after failover.

Query Plans: How MongoDB Selects Indexes

When MongoDB receives a query, it does not simply run the query. It first determines the most efficient way to satisfy it. This process involves generating candidate query plans, running them against the query planner, and selecting the winner.

Plan Generation

For each query shape (a combination of query filter, sort, and projection), the query planner considers every index that could possibly satisfy part of the query. It generates one or more candidate plans.

// Create indexes for the query planner to consider
db.restaurants.createIndex({ "address.zipcode": 1 })
db.restaurants.createIndex({ "address.zipcode": 1, cuisine: 1 })
db.restaurants.createIndex({ cuisine: 1, "address.zipcode": 1 })
db.restaurants.createIndex({ stars: -1 })

// Query that triggers plan selection
db.restaurants.find({
  "address.zipcode": { $gt: "50000" },
  cuisine: "Sushi"
}).sort({ stars: -1 })

The query planner identifies these candidate indexes and creates a plan for each one. It also includes a plan that performs a collection scan.

Plan Trial and Winner Selection

MongoDB executes each candidate plan for a short trial period, interleaving their execution. The first plan to return 101 results (or reach a time limit) wins. The winning plan is cached and used for subsequent queries of the same shape.

// View the winning plan and rejected plans
db.restaurants.find({
  "address.zipcode": { $gt: "50000" },
  cuisine: "Sushi"
}).sort({ stars: -1 }).explain("queryPlanner")

The trial period is short by design — MongoDB favors getting results quickly. This means the winning plan is not guaranteed to be optimal for the entire result set. A plan that fetches the first 101 documents quickly might degrade when sorting millions of documents.

Plan Cache

Once MongoDB picks a winning plan, it caches the plan for that query shape. The cache maps query shapes to plans, so the next identical query skips the trial phase.

// View the plan cache entries for a collection
db.restaurants.getPlanCache().list()

// Clear the plan cache (forces re-evaluation on next query)
db.restaurants.getPlanCache().clear()

Plan Revision

The plan cache entry includes a feedback field that tracks how the winning plan performs over time. MongoDB uses a mechanism called plan cache revision to detect when a cached plan starts performing poorly.

When the ratio of examined documents to returned documents exceeds a threshold, MongoDB evicts the cached plan and reruns the full trial. This can happen after index creation, data distribution changes, or the addition of new indexes.

// Check if a plan cache entry exists for a specific query shape
db.restaurants.getPlanCache().getKey({
  "address.zipcode": { $gt: "50000" },
  cuisine: "Sushi"
})

You should rarely need to clear the plan cache manually, but it is useful during index migrations. After dropping old indexes and creating new ones, clearing the cache forces the planner to re-evaluate all plans.

Explain Output: Anatomy and Interpretation

The explain() command is your primary tool for understanding how MongoDB executes a query. It takes three verbosity modes.

Verbosity Modes

// Default: shows plan selection but no execution stats
db.restaurants.find({ cuisine: "Sushi" }).explain()

// queryPlanner mode (same as default)
db.restaurants.find({ cuisine: "Sushi" }).explain("queryPlanner")

// executionStats: includes actual execution metrics
db.restaurants.find({ cuisine: "Sushi" }).explain("executionStats")

// allPlansExecution: execution stats for all plans, not just the winner
db.restaurants.find({ cuisine: "Sushi" }).explain("allPlansExecution")

Use "executionStats" for day-to-day query tuning. Use "allPlansExecution" when you want to compare the winning plan against rejected plans to understand whether the planner made the right choice.

Explain Output Anatomy

db.restaurants.explain("executionStats").find({
  "address.state": "NY",
  stars: { $gt: 3, $lt: 4 }
}).sort({ name: 1 })

This query produces the following output structure:

{
  "queryPlanner": {
    "plannerVersion": 1,
    "namespace": "m201.restaurants",
    "indexFilterSet": false,
    "parsedQuery": {
      // Normalized query predicates
    },
    "winningPlan": {
      "stage": "SORT",
      "sortPattern": { "name": 1 },
      "inputStage": {
        "stage": "SORT_KEY_GENERATOR",
        "inputStage": {
          "stage": "FETCH",
          "inputStage": {
            "stage": "IXSCAN",
            "keyPattern": { "address.state": 1, "stars": 1, "name": 1 },
            "indexName": "address.state_1_stars_1_name_1",
            "isMultiKey": false,
            "isUnique": false,
            "isSparse": false,
            "isPartial": false,
            "indexVersion": 2,
            "direction": "forward",
            "indexBounds": {
              "address.state": ["[\"NY\", \"NY\"]"],
              "stars": ["(3.0, 4.0)"],
              "name": ["[MinKey, MaxKey]"]
            }
          }
        }
      }
    },
    "rejectedPlans": []
  },
  "executionStats": {
    "executionSuccess": true,
    "nReturned": 3335,
    "executionTimeMillis": 20,
    "totalKeysExamined": 3335,
    "totalDocsExamined": 3335,
    "executionStages": {
      "stage": "SORT",
      "nReturned": 3335,
      "executionTimeMillisEstimate": 15,
      "inputStage": {
        "stage": "SORT_KEY_GENERATOR",
        "inputStage": {
          "stage": "FETCH",
          "totalDocsExamined": 3335,
          "inputStage": {
            "stage": "IXSCAN",
            "nReturned": 3335,
            "totalKeysExamined": 3335,
            "seeks": 1
          }
        }
      }
    }
  },
  "serverInfo": {
    "host": "hostname",
    "port": 27017,
    "version": "7.0.0"
  },
  "ok": 1
}

Explain Field Reference

Field Description Typical Values What to Look For
winningPlan.stage Top-level operation IXSCAN, COLLSCAN, FETCH, SORT, SHARD_MERGE, OR, AND, IDHACK COLLSCAN means no usable index was found
winningPlan.inputStage.stage Child operation IXSCAN, FETCH, SORT_KEY_GENERATOR IXSCAN confirms MongoDB used an index
rejectedPlans Plans that lost the trial Array of plan objects Empty array means only one plan was viable
executionStats.executionSuccess Whether the query executed true / false false means the query failed
executionStats.nReturned Documents returned Integer Should match your count()
executionStats.executionTimeMillis Total execution time Integer Primary performance metric
executionStats.totalKeysExamined Index key entries traversed Integer Should be close to nReturned
executionStats.totalDocsExamined Documents fetched from disk Integer Large values after IXSCAN indicate covered query opportunity
executionStats.executionStages.seeks How many times the index cursor repositioned Integer High seeks with low nReturned suggests a poorly selective index
indexBounds Key ranges scanned Range expressions Should match your query predicates
isMultiKey Whether the index contains array values true / false Multi-key indexes cannot be used for covered queries on array paths

Key Ratios to Calculate

Keys Examined to Docs Returned: Divide totalKeysExamined by nReturned. A ratio close to 1.0 means the index is highly selective. A ratio above 100 means the index is filtering poorly and you should consider a different index.

Docs Examined to Docs Returned: Divide totalDocsExamined by nReturned. If this is close to 1.0 and totalKeysExamined is significantly higher, the index is selective on the key but requires many document fetches. Consider a covered query by adding all required fields to the index.

Forcing Indexes with hint()

MongoDB’s query planner is excellent for most workloads, but it sometimes chooses a suboptimal plan. The planner is cost-based, not rule-based, and its trial period favors plans that return the first results quickly.

When you know which index should be used, override the planner with hint():

// Force MongoDB to use a specific index
db.people.find({
  name: "John Doe",
  zipcode: { $gt: "63000" }
}).hint({ name: 1, zipcode: 1 })

You can specify the index by key pattern or by index name:

// Hint by key pattern
db.people.find({ zipcode: { $gt: "63000" } }).hint({ zipcode: 1 })

// Hint by index name
db.people.find({ zipcode: { $gt: "63000" } }).hint("zipcode_1")

// Hint to force a collection scan (useful for debugging)
db.people.find({ zipcode: { $gt: "63000" } }).hint({ $natural: 1 })

Use hint() sparingly in production. Hard-coding an index hint makes your application brittle to index changes. When you drop or rename an index, any query that hints it will fail. The preferred approach is to tune the index selection by creating better indexes rather than overriding the planner.

// If hint() fails (index missing), you get this error:
// Error: error: {
//   "ok": 0,
//   "errmsg": "bad hint",
//   "code": 2
// }

A safer pattern is to use hint() in a staging or pre-production environment to validate your index strategy, then rely on the planner in production.

Index Resource Allocation

Indexes do not reside in a separate storage area. MongoDB stores indexes alongside data in the same storage engine (WiredTiger by default). Understanding how indexes consume memory is critical for sizing your deployment.

Indexes in RAM

WiredTiger uses the operating system’s filesystem cache for both data and indexes. The working set — the subset of data and indexes that your active queries touch — must fit in available RAM for consistent performance.

// Check index sizes
db.movies.stats().totalIndexSize

// Get per-index breakdown
db.movies.aggregate([
  { $indexStats: {} }
])

MongoDB produces a covered query when all fields required by the query exist in the index. A covered query reads from the index alone, never touching the data files. This is the most efficient query pattern because it avoids loading documents from disk or cache.

// If the query only needs fields in the index, it's a covered query
db.movies.createIndex({ title: 1, year: 1, rating: 1 })

// This query is covered — only index fields are projected
db.movies.find(
  { title: "Inception" },
  { title: 1, year: 1, rating: 1, _id: 0 }
)

Check for covered queries in explain output: when the top-level stage is IXSCAN with no FETCH child stage, the query is covered. You will see totalDocsExamined: 0.

Working Set Considerations

The working set size is the total amount of index and data pages that active queries access. When the working set exceeds available RAM, MongoDB must page data from disk on every access, causing latency spikes.

// View index access patterns over time
db.movies.aggregate([
  { $indexStats: {} },
  { $sort: { accesses: -1 } }
])

For production deployments, ensure that:

  • All indexes used by critical queries fit in RAM.
  • The total index size of your most-heavily-used collections is less than available memory.
  • Indexes with high write overhead (many fields, compound indexes on frequently-updated fields) are justified by read performance gains.

Benchmarking Methodology

Benchmarking MongoDB index performance requires isolating variables and using realistic workloads. The tools fall into three categories based on what they measure.

Low-Level Benchmarking

Low-level benchmarks measure raw hardware and database engine throughput.

sysbench runs simple read, write, and mixed workloads against a database:

# Install sysbench on Ubuntu
sudo apt-get install sysbench

# Prepare a table with 1 million rows
sysbench /usr/share/sysbench/oltp_read_write.lua \
  --table-size=1000000 \
  --db-driver=pgsql \
  --pgsql-host=127.0.0.1 \
  prepare

# Run read-only benchmark
sysbench /usr/share/sysbench/oltp_read_only.lua \
  --table-size=1000000 \
  --threads=16 \
  --time=60 \
  run

iibench-mongodb measures insert and query throughput specifically for MongoDB:

# Clone iibench-mongodb
git clone https://github.com/mongodb/iibench-mongodb.git
cd iibench-mongodb

# Run a benchmark with 10 million documents
python iibench.py \
  --host localhost \
  --port 27017 \
  --db iibench \
  --collection test \
  --numdocs 10000000 \
  --batchsize 1000 \
  --runtime 300 \
  2>&1 | tee iibench_results.log

Database Server Benchmarking

These tools test the database server under application-like workloads.

YCSB (Yahoo Cloud Serving Benchmark) supports multiple databases including MongoDB:

# Clone YCSB
git clone https://github.com/brianfrankcooper/YCSB.git
cd YCSB

# Load 1 million records with MongoDB binding
./bin/ycsb load mongodb \
  -s \
  -P workloads/workloada \
  -p mongodb.url=mongodb://localhost:27017/ycsb \
  -p recordcount=1000000 \
  -threads 32

# Run a mixed read-write workload
./bin/ycsb run mongodb \
  -s \
  -P workloads/workloada \
  -p mongodb.url=mongodb://localhost:27017/ycsb \
  -p operationcount=100000 \
  -threads 32

YCSB workload types:

Workload Read % Write % Use Case
A 50 50 Session store, update-heavy
B 95 5 Content serving, read-heavy
C 100 0 Read-only caching tier
D 95 5 Read-latest (newly inserted)
E 0 100 Write-heavy append logs
F 50 50 Read-modify-write counters

TPC-H measures analytical query performance at scale:

# Generate TPC-H data at scale factor 1 (1 GB)
./dbgen -s 1

# Load into MongoDB and run benchmark queries
# Tools: mongodb-tpch (community projects)
python run_tpch_queries.py \
  --host localhost \
  --port 27017 \
  --scale-factor 1 \
  --output tpch_results.json

Distributed Systems Benchmarking

When benchmarking a replica set or sharded cluster, you must test consistency guarantees, not just throughput.

Jepsen tests partition tolerance and consistency:

# Jepsen requires Leiningen (Clojure build tool)
# Configuration is project-specific, but typically:
lein run test \
  --node-config nodes.edn \
  --workload list-append \
  --time-limit 300 \
  --concurrency 10n

HiBench includes MongoDB workloads for big data pipeline testing:

# HiBench runs on Hadoop but includes MongoDB adapters
./bin/workloads/mongodb/prepare/prepare.sh \
  --scale tiny \
  --operation scan

# Run benchmark
./bin/workloads/mongodb/mongodb/run.sh

Benchmarking Anti-Patterns

Benchmarking seems straightforward but contains subtle traps that invalidate results.

Using Incorrect Database Variations

The most common anti-pattern: benchmarking with the wrong version, wrong storage engine, or wrong configuration. A MongoDB 3.6 result using MMAPv1 has no relevance to a MongoDB 7.0 deployment using WiredTiger. Always benchmark your exact production configuration.

// Check the exact version and storage engine
db.serverStatus().versions
db.serverStatus().storageEngine

Ignoring Working Set Size

Running a benchmark with a dataset that fits entirely in RAM produces results that cannot be extrapolated to production. If production has a 500 GB working set on a 64 GB machine, benchmark with a working set that exceeds available memory.

Single-Shard Benchmarking for Sharded Deployments

Testing a single shard and multiplying throughput by the number of shards assumes linear scaling. It ignores the overhead of the mongos router, the config server, and cross-shard queries.

Missing Concurrency

Running queries one at a time in the mongo shell is not a benchmark. Use a tool like YCSB or iibench that drives concurrent workloads.

# Bad: single-threaded shell benchmark
mongo --eval "for (var i = 0; i < 100000; i++) db.test.insert({x: i})"

# Good: concurrent benchmark with iibench
python iibench.py --host localhost --threads 16 --runtime 120

Benchmarking Without Indexes

Running a query without indexes and then repeating with indexes only tells you how bad a collection scan is. It does not help you choose between two candidate indexes. Always run comparative benchmarks on both candidates.

// Strategy: compare two candidate indexes
db.movies.createIndex({ year: 1, rating: -1 })
db.movies.createIndex({ rating: -1, year: 1 })

// Benchmark each with explain
db.movies.explain("executionStats").find({ year: 2010, rating: { $gt: 7 } })

Advanced Index Types

Beyond standard compound indexes, MongoDB offers several index types for specialized query patterns.

Partial Indexes

A partial index only indexes documents that match a filter expression. This produces a smaller index and faster queries when you only need a subset of documents.

// Only index documents where age >= 21 (useful for age-gated queries)
db.users.createIndex(
  { "address.state": 1, "age": 1 },
  { partialFilterExpression: { age: { $gte: 21 } } }
)

// This query uses the partial index
db.users.find({ "address.state": "CA", age: { $gte: 21 } })

// This query CANNOT use the partial index (age filter doesn't match)
db.users.find({ "address.state": "CA", age: { $lt: 21 } })

Partial indexes are useful for soft-delete patterns, multi-tenant systems with active/inactive status, and any workload where most queries target a subset of the data.

Sparse Indexes

A sparse index only indexes documents that contain the indexed field. Documents missing the field are excluded from the index.

// Only index documents that have a "twitter" field
db.users.createIndex({ twitter: 1 }, { sparse: true })

// This query can use the sparse index
db.users.find({ twitter: "example" })

// This query may NOT use the sparse index depending on the query shape
db.users.find({ twitter: { $exists: false } })

In MongoDB 3.2+, the partialFilterExpression feature supersedes sparse indexes in most cases. Sparse indexes remain useful for geo-spatial and text indexes where you want to exclude null values.

TTL Indexes

A TTL (time-to-live) index automatically removes documents from a collection after a specified number of seconds. The indexed field must be a date or an array of dates.

// Automatically delete documents 24 hours after their createdAt timestamp
db.sessions.createIndex(
  { createdAt: 1 },
  { expireAfterSeconds: 86400 }
)

// Insert a document that will expire in 24 hours
db.sessions.insertOne({
  sessionId: "abc123",
  data: { user: "admin", token: "...", },
  createdAt: new Date()
})

TTL indexes do not provide guaranteed deletion at the exact second. MongoDB runs a background job every 60 seconds that checks for expired documents. Deletion times can vary by up to one minute.

// Change the TTL on an existing index
db.runCommand({
  collMod: "sessions",
  index: {
    keyPattern: { createdAt: 1 },
    expireAfterSeconds: 3600  // changed to 1 hour
  }
})

// List the TTL monitor's status
db.adminCommand({ serverStatus: 1 }).ttl

Hidden Indexes

Hidden indexes let you drop an index from consideration by the query planner without actually deleting it. This is useful for safely testing index removal in production.

// Create an index
db.movies.createIndex({ title: 1, year: 1 })

// Hide it: queries will not use it, but it still consumes storage
db.movies.hideIndex({ title: 1, year: 1 })

// Unhide it later
db.movies.unhideIndex({ title: 1, year: 1 })

The workflow for safe index removal:

  1. Hide the index and monitor for a week. If no query performance degrades, proceed.
  2. If performance degrades, unhide the index immediately — it is still maintained.
  3. If no degradation occurs after monitoring, drop the index.
// Safe removal workflow
db.movies.hideIndex("title_1_year_1")
// ... monitor for one week ...
// No issues found → drop
db.movies.dropIndex("title_1_year_1")

Hidden indexes are a zero-risk alternative to immediately dropping an index and discovering a week later that a critical report depends on it.

Index Intersection

When a single index cannot satisfy a query but multiple indexes each cover part of it, MongoDB can intersect two or more indexes to produce results.

db.orders.createIndex({ status: 1 })
db.orders.createIndex({ customerId: 1 })

// MongoDB may intersect both indexes to satisfy this query
db.orders.find({ status: "shipped", customerId: "ABC123" })

Index intersection is MongoDB’s fallback when no compound index exists for a query shape. It is less efficient than using a single compound index because it requires merging two index cursors. The explain output shows an AND or OR stage at the top level.

{
  "winningPlan": {
    "stage": "AND",
    "inputStages": [
      { "stage": "IXSCAN", "indexName": "status_1" },
      { "stage": "IXSCAN", "indexName": "customerId_1" }
    ]
  }
}

Design your compound indexes to avoid index intersection. A compound index on { status: 1, customerId: 1 } is more efficient than MongoDB intersecting two separate indexes.

Index Management Commands

// List all indexes on a collection
db.movies.getIndexes()

// View index usage statistics
db.movies.aggregate([{ $indexStats: {} }])

// Create a compound index with options
db.orders.createIndex(
  { status: 1, createdAt: -1 },
  {
    name: "status_createdAt_desc",
    unique: false,
    partialFilterExpression: { status: { $ne: "cancelled" } }
  }
)

// Drop a specific index by name
db.orders.dropIndex("status_createdAt_desc")

// Drop all indexes except the _id index
db.orders.dropIndexes()

// Re-index a collection (rebuilds all indexes)
db.orders.reIndex()

// TTL monitor interval (global config)
db.adminCommand({ setParameter: 1, ttlMonitorSleepSecs: 30 })

// Compact indexes (reclaims disk space)
db.runCommand({ compact: "orders" })

Index Strategy Decision Guide

Query Pattern Best Index Type Example
Equality on one field Single field { status: 1 }
Equality + range Compound, equality first { status: 1, date: 1 }
Equality + sort Compound covering both { status: 1, name: 1 } with sort { name: 1 }
Multiple equalities Compound with most selective first { customerId: 1, status: 1 }
Text search Text index { description: "text", title: "text" }
Geospatial queries 2dsphere index { location: "2dsphere" }
Filter on subset of docs Partial index partialFilterExpression: { active: true }
Automatic document expiry TTL index { createdAt: 1 }, { expireAfterSeconds: 86400 }
Present/absent field check Sparse index { twitter: 1 }, { sparse: true }
Large collection, safe removal test Hidden index hideIndex("index_name")
Query needing all fields from index only Compound index (covered query) Include all projected fields
Regex on prefix of string Single field on the string { username: 1 } with ^prefix regex
Count without fetching docs Index-only count Index must contain the query field
Aggregation pipeline sort Index matching the sort + first group stage { $match: { status: "A" } } then { $sort: { date: -1 } }

Resources

Comments

Share this article

Scan to read on mobile