MongoDB Data Modeling: Design Patterns (Attribute, Extended Reference, Subset, Computed)

Introduction

MongoDB’s document model is flexible, but flexibility without structure leads to poorly performing databases. Schema design patterns solve recurring data modeling problems — they are battle-tested approaches for document structure, access patterns, and query optimization.

This guide covers eight essential patterns: Attribute, Extended Reference, Subset, Computed, Schema Versioning, Bucket, Approximation, and Outlier. Each pattern solves a specific class of problem, from handling polymorphic fields to managing working set size. You will see concrete schema examples, query patterns, trade-offs, and real-world use cases for each.

The Cost of Poor Schema Design

Before diving into patterns, understand what bad schema design costs:

Unnecessary $lookup operations — joining across collections when embedding would serve better, causing N+1 queries
Oversized working sets — documents carrying rarely-used fields force MongoDB to page through more data than needed, degrading cache performance
Repeated computations — running the same aggregation pipeline every time a page loads instead of storing pre-computed results
Schema migration downtime — every application deploy requiring a data migration script because the schema changed

Patterns exist to avoid these costs. Choose the right one for your access patterns and the trade-offs become acceptable.

Attribute Pattern

Description

The Attribute Pattern transforms a set of related fields that share the same characteristics into an array of key-value sub-documents. Instead of having many similar top-level fields, you store them as an array of { k: "<key>", v: "<value>" } pairs. This makes the schema polymorphic — documents in the same collection can have different sets of attributes.

Problem It Solves

You have fields that share a number of common characteristics (same data type, same semantic purpose) and you need to search across them. Without the pattern, you end up with sparse documents containing many fields where most are null.

When to Use

Fields that share the same characteristics — for example, product specifications, movie metadata, or event dates across multiple venues
Need to index and query across many similar fields
Schema where the set of possible attributes is large but any single document uses a small subset
Polymorphic data where different document types need different fields

Schema Example

Before the Attribute Pattern, a museum collection document might look like this:

// BAD: sparse fields, hard to query across event venues
{
  "_id": ObjectId("..."),
  "title": "Starry Night",
  "artist": "van Gogh",
  "date_acquisition": ISODate("1941-01-01"),
  "location": "Gallery 502",
  "on_display": true,
  "in_house": true,
  "events": {
    "moma": ISODate("1980-06-01"),
    "louvres": ISODate("1985-09-15"),
    "met": ISODate("1990-03-20")
  }
}

After applying the Attribute Pattern:

// GOOD: polymorphic, indexable, add venues without altering schema
{
  "_id": ObjectId("..."),
  "title": "Starry Night",
  "artist": "van Gogh",
  "location": "Gallery 502",
  "on_display": true,
  "in_house": true,
  "events": [
    { "k": "moma", "v": ISODate("1980-06-01") },
    { "k": "louvres", "v": ISODate("1985-09-15") },
    { "k": "met", "v": ISODate("1990-03-20") }
  ]
}

Query Example

Search across all event venues:

// Find all artworks that were ever exhibited at the Louvre
db.artworks.find({ "events.k": "louvres" })

// Find artworks exhibited at MOMA after 1985
db.artworks.find({
  "events": { $elemMatch: { k: "moma", v: { $gte: ISODate("1985-01-01") } } }
})

Index Creation

db.artworks.createIndex({ "events.k": 1, "events.v": 1 })

Trade-offs

Queries become slightly more complex — you must use $elemMatch or dot notation on array elements
Application code must translate between key-value pairs and display-ready fields
Loses the self-documenting nature of typed top-level fields — "moma" as a field name conveys meaning more clearly than { k: "moma" }

Real-World Use Case

Product catalogs where different categories have different specifications. A laptop has CPU, RAM, and storage specs; a shirt has size, material, and color. Instead of separate collections or documents with hundreds of nullable fields, use the attribute pattern:

db.products.insertOne({
  "sku": "LAP-001",
  "name": "UltraBook Pro",
  "category": "laptop",
  "specs": [
    { "k": "cpu", "v": "Intel i7-13700H" },
    { "k": "ram", "v": "32GB DDR5" },
    { "k": "storage", "v": "1TB NVMe" },
    { "k": "screen", "v": "14-inch 2880x1800" }
  ]
})
db.products.insertOne({
  "sku": "SHT-002",
  "name": "Cotton Oxford",
  "category": "shirt",
  "specs": [
    { "k": "material", "v": "100% Cotton" },
    { "k": "size", "v": "M" },
    { "k": "color", "v": "Navy" }
  ]
})

Extended Reference Pattern

Description

The Extended Reference Pattern copies frequently-accessed fields from a related collection into the current document. Instead of storing only a foreign key (supplier_id) and performing a lookup every time, you embed the most commonly accessed fields from the supplier directly in the product document.

Problem It Solves

Every time you display a product page, you also need supplier information — name, phone number, maybe a rating. Doing a $lookup for each product creates N+1 performance problems. But you also do not want to embed the entire supplier document (which might contain billing addresses, contract terms, etc.) because it bloats the working set.

When to Use

One-to-many relationships where the “many” side is frequently read with a subset of the “one” side data
Reference data that changes infrequently — supplier name, country code, category label
Reducing joins in read-heavy workloads

Schema Example

// suppliers collection (source of truth)
db.suppliers.insertOne({
  "_id": ObjectId("sup001"),
  "name": "Acme Corp",
  "contact_name": "Jane Doe",
  "phone": "+1-555-0123",
  "email": "[email protected]",
  "address": {
    "street": "123 Industrial Blvd",
    "city": "Detroit",
    "state": "MI",
    "zip": "48201"
  },
  "billing_terms": "NET30",
  "contract_expires": ISODate("2027-01-01")
})

// products collection (extended reference — carries frequently-used supplier fields)
db.products.insertOne({
  "_id": ObjectId("prod001"),
  "name": "Widget X-200",
  "price": 29.99,
  "sku": "WG-X200",
  "supplier": {
    "_id": ObjectId("sup001"),
    "name": "Acme Corp",
    "phone": "+1-555-0123"
  }
})

Query Example

Retrieving a product with supplier info requires no join:

db.products.find({ "sku": "WG-X200" }, {
  "name": 1,
  "price": 1,
  "supplier.name": 1,
  "supplier.phone": 1
})

When you need full supplier details, fall back to the source collection:

const product = db.products.findOne({ "sku": "WG-X200" })
const fullSupplier = db.suppliers.findOne({ "_id": product.supplier._id })

Handling Stale Data

The trade-off is redundancy. When a supplier changes their phone number, you must update it everywhere:

// Update the source
db.suppliers.updateOne(
  { "_id": ObjectId("sup001") },
  { $set: { "phone": "+1-555-9999" } }
)
// Update all product copies (batch or background job)
db.products.updateMany(
  { "supplier._id": ObjectId("sup001") },
  { $set: { "supplier.phone": "+1-555-9999" } }
)

Use a scheduled batch job or a change stream to propagate updates if consistency is critical.

Trade-offs

Data duplication — updates must propagate to every copy
Staleness risk — copies can drift from the source of truth
Application complexity — must decide when to use the embedded copy vs. the full source

Real-World Use Case

An e-commerce order system. Every order document embeds the customer’s shipping address and name at the time of purchase. The customer may change their address later, but the order should retain the address that was used for fulfillment:

db.orders.insertOne({
  "order_id": "ORD-2026-001",
  "items": [ { "sku": "WG-X200", "qty": 2, "price": 29.99 } ],
  "total": 59.98,
  "customer": {
    "_id": ObjectId("cust001"),
    "name": "Alice Smith",
    "shipping_address": {
      "street": "456 Oak Ave",
      "city": "Portland",
      "state": "OR",
      "zip": "97201"
    }
  },
  "created_at": ISODate("2026-04-20")
})

Subset Pattern

Description

The Subset Pattern splits a document into two parts: a frequently-accessed subset and a rarely-accessed full document. The “hot” portion stays in the primary collection (fits in working set memory), while the “cold” details live in a separate collection or as a secondary document.

Problem It Solves

The working set — the data and indexes MongoDB keeps in RAM for optimal performance — grows too large. When it exceeds available memory, MongoDB must page data from disk, degrading read performance. The solution is to reduce the size of frequently-accessed documents.

When to Use

Working set exceeds available RAM
Documents contain large fields (arrays of comments, full-text descriptions, images as base64) that are rarely accessed in the primary query
Scaling vertically (adding RAM) is expensive or impractical
Read-heavy workload where most queries only need a subset of fields

Schema Example

Product reviews are a classic case. A product page shows ratings, a snippet, and maybe 5 reviews. But the product might have 10,000 reviews.

// Primary collection — frequently accessed, small documents
db.products.insertOne({
  "_id": ObjectId("prod001"),
  "name": "UltraBook Pro",
  "price": 1299.99,
  "rating_avg": 4.5,
  "review_count": 10234,
  "top_reviews": [
    {
      "review_id": ObjectId("rev9999"),
      "rating": 5,
      "headline": "Best laptop I have owned",
      "author": "tech_guru",
      "created_at": ISODate("2026-04-01")
    },
    {
      "review_id": ObjectId("rev10000"),
      "rating": 4,
      "headline": "Great but battery could be better",
      "author": "nomad_dev",
      "created_at": ISODate("2026-04-15")
    }
  ]
})

// Secondary collection — full details, less frequently accessed
db.reviews.insertOne({
  "_id": ObjectId("rev9999"),
  "product_id": ObjectId("prod001"),
  "rating": 5,
  "headline": "Best laptop I have owned",
  "body": "I have been using this for three months. The build quality is exceptional, the screen is gorgeous, and it handles Docker, Kubernetes, and multiple VMs without breaking a sweat...",
  "author": "tech_guru",
  "helpful_count": 342,
  "created_at": ISODate("2026-04-01")
})

Query Example

Pagination for full reviews (cold data):

db.reviews.find({ "product_id": ObjectId("prod001") })
  .sort({ "created_at": -1 })
  .skip(20)
  .limit(10)

Index Strategy

// Primary collection index — light, fits in RAM
db.products.createIndex({ "rating_avg": -1 })

// Secondary collection index — supports pagination
db.reviews.createIndex({ "product_id": 1, "created_at": -1 })

Trade-offs

Application must manage two collections — determines what goes in the subset vs. the full document
Pagination requires a query to the secondary collection, which may hit disk
Consistency concern: review count in the product document must be updated atomically

Real-World Use Case

E-commerce product listings. A search results page shows 50 products with name, price, rating, and thumbnail URL. That is the subset. The full product page includes descriptions, specifications, reviews, and related products — all loaded on-demand from secondary collections or cached after the first page load.

Computed Pattern

Description

The Computed Pattern stores the result of an expensive computation directly in the document instead of recalculating it on every read. The computation runs once (on write, via a scheduled job, or via a change stream) and subsequent reads are instant lookups.

Problem It Solves

Aggregations over large datasets — sums, averages, counts, or complex transformations — are expensive. Running them on every page load wastes CPU and increases latency.

When to Use

Read-heavy workloads where the same computation runs repeatedly
Aggregation pipelines that scan many documents (summing order totals, averaging ratings)
Computations that do not need real-time precision (acceptable staleness window)
Dashboard metrics, leaderboards, summary statistics

Schema Example

An e-commerce store needs to show total revenue per product. Instead of summing all orders every time:

// Computed field stored directly on the document
db.products.updateOne(
  { "_id": ObjectId("prod001") },
  { $set: { "total_revenue": 152340.50, "total_units_sold": 1124 } }
)

Aggregation Pipeline (Without Computed Pattern — Slow)

// Every page load: scan thousands of orders, group, sum — expensive
db.orders.aggregate([
  { $match: { "product_id": ObjectId("prod001") } },
  { $group: {
      _id: "$product_id",
      total_revenue: { $sum: { $multiply: ["$price", "$quantity"] } },
      total_units: { $sum: "$quantity" }
  }}
])

Automating Computation with a Change Stream

// MongoDB shell — watch orders collection and update computed values
const pipeline = [
  { $match: { "operationType": { $in: ["insert", "update"] } } }
]
const changeStream = db.orders.watch(pipeline)

changeStream.on("change", (change) => {
  const order = change.fullDocument
  db.products.updateOne(
    { "_id": order.product_id },
    { $inc: { "total_revenue": order.price * order.quantity, "total_units_sold": order.quantity } }
  )
})

Computation via Scheduled Job (Cron)

// Run every hour — recompute top seller metrics
db.products.aggregate([
  { $lookup: {
      from: "orders",
      localField: "_id",
      foreignField: "product_id",
      as: "orders"
  }},
  { $addFields: {
      "total_revenue": { $sum: "$orders.revenue" },
      "total_units_sold": { $sum: "$orders.quantity" }
  }},
  { $unset: "orders" },
  { $merge: { into: "products", on: "_id", whenMatched: "merge" } }
])

Trade-offs

Staleness — computed values are only as fresh as the last update
Write overhead — every relevant write must trigger a recompute or increment update
Complexity — must decide how frequently to recompute (realtime, every N minutes, daily)

Real-World Use Case

A leaderboard for a gaming platform. Computing player ranks by summing scores across thousands of matches per player is expensive. Instead, maintain a players.rank_score field that updates incrementally after each match:

// After each match, increment the winner's score atomically
db.players.updateOne(
  { "_id": match.winner_id },
  { $inc: { "rank_score": match.points_earned, "matches_played": 1 } }
)
// Leaderboard query is now a simple sort
db.players.find().sort({ "rank_score": -1 }).limit(100)

Schema Versioning Pattern

Description

The Schema Versioning Pattern allows documents with different schema versions to coexist in the same collection. Each document carries a schema_version field (or equivalent) that tells the application how to interpret it. The application handles both versions transparently.

Problem It Solves

Schema evolution in production. Changing a document’s structure usually requires a migration — updating every document. For large collections with hundreds of millions of documents, a full migration can take hours or days, requiring downtime or complex blue-green deployments.

When to Use

Schema changes are frequent or unpredictable
Collection is too large to migrate in one deploy cycle
Zero-downtime deploys are required
Different services with different code versions access the same collection

Schema Example

Version 1 of a user profile:

db.users.insertOne({
  "_id": ObjectId("..."),
  "schema_version": 1,
  "username": "alice",
  "email": "[email protected]",
  "full_name": "Alice Smith",
  "address": "456 Oak Ave, Portland, OR 97201",
  "created_at": ISODate("2024-01-01")
})

Version 2 splits address into components and adds phone:

db.users.insertOne({
  "_id": ObjectId("..."),
  "schema_version": 2,
  "username": "bob",
  "email": "[email protected]",
  "full_name": "Bob Jones",
  "address": {
    "street": "123 Pine St",
    "city": "Seattle",
    "state": "WA",
    "zip": "98101"
  },
  "phone": "+1-555-7777",
  "created_at": ISODate("2026-04-01")
})

Application Code Handling Versions

// Application layer — read both versions transparently
function formatUserDisplay(user) {
  if (user.schema_version === 1) {
    return {
      displayName: user.full_name,
      location: user.address      // flat string
    }
  }
  if (user.schema_version === 2) {
    return {
      displayName: user.full_name,
      location: `${user.address.city}, ${user.address.state}`
    }
  }
  throw new Error(`Unknown schema version: ${user.schema_version}`)
}

Migration Strategy (Lazy Migration)

Instead of migrating all at once, migrate documents as they are accessed:

// On read, migrate in place
function readUser(userId) {
  const user = db.users.findOne({ "_id": userId })
  if (user.schema_version < 2) {
    const parts = user.address.split(", ")
    db.users.updateOne(
      { "_id": userId },
      { $set: {
          "schema_version": 2,
          "address.street": parts[0],
          "address.city": parts[1],
          "address.state": parts[2].split(" ")[0],
          "address.zip": parts[3]
      }}
    )
  }
  return user
}

Batch Migration Script

// Background job — migrate 1000 documents at a time
let migrated = 0
const batchSize = 1000
while (true) {
  const batch = db.users.find({ "schema_version": { $lt: 2 } }).limit(batchSize).toArray()
  if (batch.length === 0) break
  for (const user of batch) {
    const parts = user.address.split(", ")
    db.users.updateOne(
      { "_id": user._id },
      { $set: {
          "schema_version": 2,
          "address.street": parts[0],
          "address.city": parts[1],
          "address.state": parts[2].split(" ")[0],
          "address.zip": parts[3]
      }}
    )
  }
  migrated += batch.length
  print(`Migrated ${migrated} documents so far...`)
}

Trade-offs

Application code must handle multiple versions — adds conditional logic
Old data may never be migrated if you rely solely on lazy migration
Indexes must work across all versions — a field that exists in v2 but not v1 will be indexed as null for v1 documents
Eventual consistency — different services might see different versions of the same document

Real-World Use Case

SaaS platforms where different customers have different feature sets. A CRM might add custom fields per customer. New fields are added to documents as the customer enables them, and old documents without those fields are handled gracefully by the application.

Bucket Pattern

Description

The Bucket Pattern groups related time-series or sequential data points into a single document instead of storing each data point as its own document. A “bucket” covers a time window (e.g., one hour) and contains an array of data points within that window.

Problem It Solves

Time-series data — sensor readings, logs, metrics — produces a massive number of documents. Each document has overhead (12 bytes for _id, index entries, document metadata). Storing one document per data point wastes storage and degrades write throughput.

When to Use

IoT sensor data, application logs, stock tickers, analytics events
Data points arrive frequently (every second, every minute)
Queries typically ask for a range of time rather than individual points
Reducing document count per time range improves index efficiency

Schema Example

Instead of one document per temperature reading:

// BAD: one document per reading — millions of tiny documents
db.sensor_data.insertOne({ "sensor_id": "sensor01", "ts": ISODate("2026-04-26T10:00:00Z"), "temp": 22.5 })
db.sensor_data.insertOne({ "sensor_id": "sensor01", "ts": ISODate("2026-04-26T10:01:00Z"), "temp": 22.7 })
db.sensor_data.insertOne({ "sensor_id": "sensor01", "ts": ISODate("2026-04-26T10:02:00Z"), "temp": 22.6 })

Group readings into hourly buckets:

// GOOD: one document per sensor per hour — 24 documents/day instead of 1440
db.sensor_data.insertOne({
  "sensor_id": "sensor01",
  "bucket_start": ISODate("2026-04-26T10:00:00Z"),
  "bucket_end": ISODate("2026-04-26T11:00:00Z"),
  "readings_count": 60,
  "readings": [
    { "t": ISODate("2026-04-26T10:00:00Z"), "v": 22.5 },
    { "t": ISODate("2026-04-26T10:01:00Z"), "v": 22.7 },
    { "t": ISODate("2026-04-26T10:02:00Z"), "v": 22.6 }
  ],
  "stats": {
    "min": 22.1,
    "max": 23.0,
    "avg": 22.6
  }
})

Query Example

Get all readings for a sensor in a time range:

db.sensor_data.find({
  "sensor_id": "sensor01",
  "bucket_start": { $gte: ISODate("2026-04-26T00:00:00Z") },
  "bucket_end": { $lte: ISODate("2026-04-27T00:00:00Z") }
})

Unwind to get individual readings:

db.sensor_data.aggregate([
  { $match: { "sensor_id": "sensor01", "bucket_start": { $gte: ISODate("2026-04-26T10:00:00Z") } } },
  { $unwind: "$readings" },
  { $match: { "readings.t": { $gte: ISODate("2026-04-26T10:05:00Z"), $lte: ISODate("2026-04-26T10:15:00Z") } } }
])

Pre-allocating Buckets

For predictable time-series data, pre-allocate buckets and use $push to append:

// Pre-allocate the hour bucket on first write
db.sensor_data.updateOne(
  { "sensor_id": "sensor01", "bucket_start": ISODate("2026-04-26T11:00:00Z") },
  { $setOnInsert: {
      "sensor_id": "sensor01",
      "bucket_start": ISODate("2026-04-26T11:00:00Z"),
      "bucket_end": ISODate("2026-04-26T12:00:00Z"),
      "readings": [],
      "readings_count": 0
  }},
  { upsert: true }
)
// Append reading to existing bucket
db.sensor_data.updateOne(
  { "sensor_id": "sensor01", "bucket_start": ISODate("2026-04-26T11:00:00Z") },
  { $push: { "readings": { "t": new Date(), "v": 22.8 } }, $inc: { "readings_count": 1 } }
)

Trade-offs

Maximum document size (16 MB) limits bucket capacity — at 100 bytes per reading, one bucket holds ~160,000 readings
Write amplification — each $push adds an array element and updates the document’s position on disk
Query complexity — individual point lookups require $unwind or $elemMatch
Choosing bucket boundaries requires domain knowledge (hourly, daily, by event count)

Real-World Use Case

IoT temperature monitoring across a warehouse. Each sensor reports every 30 seconds. 50 sensors × 2 readings/minute × 60 minutes × 24 hours = 144,000 documents/day without bucketing. With hourly buckets, that becomes 50 × 24 = 1,200 documents/day — a 99% reduction.

Approximation Pattern

Description

The Approximation Pattern stores a calculated estimate instead of an exact value. The computation for exact values is expensive or involves large scans, but the use case tolerates small errors.

Problem It Solves

Exact counts or aggregations over large collections are expensive. An approximation — computed periodically or sampled — serves the same practical purpose with a fraction of the cost.

When to Use

Exact precision is not required — counts within 5-10% are acceptable
Expensive computations like $count over millions of documents, or distinct value counts on high-cardinality fields
Dashboards, badges, “trending” indicators, counters that drive UI state

Schema Example

Instead of db.posts.countDocuments({ "author": "alice" }) every time the profile page loads:

// Computed approximation — updated daily via cron
db.posts.updateOne(
  { "author": "alice" },
  { $set: { "approx_post_count": 342 } }  // real count was 347
)
// Display on profile — 342 vs 347 makes no practical difference

Using MongoDB’s $sample for Statistical Approximation

// Approximate the average order value without scanning all orders
db.orders.aggregate([
  { $sample: { size: 10000 } },          // random sample of 10k orders
  { $group: {
      _id: null,
      approx_avg_value: { $avg: "$total" },
      approx_total_count: { $sum: 1 }
  }},
  { $project: {
      approx_avg_value: 1,
      // Scale up the count estimate
      approx_total_orders: { $multiply: ["$approx_total_count", 100] }
  }}
])

Trade-offs

Inherent inaccuracy — must know the error tolerance of your stakeholders
Batch update cost — if updated too frequently, the approximation loses its advantage
Not suitable for financial calculations, inventory counts, or any system where exact values are required by regulation

Real-World Use Case

“Trending topics” on a social platform. Exact engagement counts require scanning millions of interactions. Instead, sample a fraction of interactions, compute relative scores, and display top trending items. Whether “Azure Functions” has 12,450 or 12,847 mentions does not change whether it is trending.

Outlier Pattern

Description

The Outlier Pattern handles documents that do not fit the normal schema or access pattern. Instead of making every document handle the exceptional case — adding nullable fields, bloating indexes — isolate the outlier behavior into a separate field or separate collection.

Problem It Solves

A small percentage of documents require more fields, larger arrays, or different queries than the majority. Applying the general-purpose solution to all documents penalizes the 99% for the sake of the 1%.

When to Use

A small subset of documents has significantly more data than the rest (e.g., a viral post with 100k comments vs. a normal post with 20 comments)
Power users need additional fields that 99% of users do not use
Documents that exceed the 16 MB limit if treated uniformly

Schema Example

A social media post — most posts have a handful of comments, but a viral post has hundreds of thousands. Instead of embedding all comments (which would cause the document to grow unboundedly), store normal posts with embedded comments and outlier posts with a comment reference:

// Normal post — comments embedded (99.9% of posts)
db.posts.insertOne({
  "_id": ObjectId("post001"),
  "title": "My first day with Go",
  "content": "I started learning Go today...",
  "comments": [
    { "author": "dev_guru", "text": "Welcome to Go!", "created_at": ISODate("2026-04-26") },
    { "author": "rust_fan", "text": "You should try Rust too.", "created_at": ISODate("2026-04-26") }
  ],
  "comment_count": 2
})

// Viral post — outlier detected, comments stored separately
db.posts.insertOne({
  "_id": ObjectId("post002"),
  "title": "How I deployed to production on a Friday",
  "content": "It was a calm afternoon...",
  "comments": [],                         // no embedded comments
  "comment_count": 143892,
  "outlier": true,                        // flag for the application
  "comments_ref": "post002_comments"      // reference to separate collection
})

// Separate collection for outlier comments
db.post002_comments.insertOne({
  "post_id": ObjectId("post002"),
  "author": "sre_ninja",
  "text": "This is why we have change advisory boards.",
  "created_at": ISODate("2026-04-26")
})

Application Logic for Outlier Detection

function getComments(postId) {
  const post = db.posts.findOne({ "_id": postId })
  if (post.outlier) {
    // Use paginated query for viral posts
    return db[post.comments_ref].find({ "post_id": postId })
      .sort({ "created_at": -1 })
      .limit(20)
      .toArray()
  }
  // Normal post — return embedded comments directly
  return post.comments
}

Trade-offs

Application logic must check for the outlier flag on every read
Two different access patterns — more code paths to test
Outlier threshold must be tuned — set too low and most documents become “outliers,” defeating the purpose
Indexes on the main collection do not cover outlier data

Real-World Use Case

A GitHub repository with millions of stars. Storing all stargazer data in the repo document is impossible — even just user IDs would exceed 16 MB. Normal repos store a stargazer count only. For the top 0.1% of repos, detailed stargazer data is stored in a separate collection and accessed only when the user clicks “View stargazers.”

Decision Matrix

Pattern	Best For	Trade-offs	When to Avoid
Attribute	Polymorphic fields, shared-field queries across sparse schemas	Complex queries with `$elemMatch`, loses self-documenting field names	Fixed, well-known fields with simple queries
Extended Reference	Reducing joins for frequently-read reference data	Data duplication, staleness risk, update propagation	Data changes frequently or must be transactionally consistent
Subset	Working set too large for RAM, read-heavy with cold data	Two-collection management, pagination overhead	All data is accessed uniformly, writes dominate reads
Computed	Expensive aggregations on read-heavy workloads	Staleness window, write amplification from recompute	Real-time accuracy required, write-heavy workloads
Schema Versioning	Zero-downtime schema evolution, large collections	Application complexity, lazy migration leaves old data	Small collections that can be migrated quickly in one deploy
Bucket	Time-series data with frequent, small data points	16 MB document limit, complex queries for individual points	Data points arrive infrequently, or high-latency requirements for point queries
Approximation	Large-scale counts/aggregations where precision is not critical	Inherent inaccuracy, batch update cost	Financial/regulatory contexts, inventory management
Outlier	Small percentage of documents with extreme data volumes	Two access patterns, threshold tuning, more code paths	Most documents are outliers (rethink schema), uniform data distribution

Practice Exercises

Exercise 1: Pattern Identification

For each scenario, identify the best pattern:

A weather service collects temperature and humidity readings every 10 seconds from 10,000 sensors.
An e-commerce site displays a product page with supplier name and phone. Supplier has 50 other fields stored elsewhere.
A social media app shows a “trending” counter on posts — exact counts are expensive but a daily update suffices.
A CRM needs to add custom fields per customer without downtime or migration scripts.
A blog platform embeds comments in posts, but a viral post has 500,000 comments and causes document growth issues.

Exercise 2: Schema Design

Design a schema for a movie review platform:

Movies have titles, release dates, genres, and directors
Users write reviews with ratings (1-5) and text
The homepage shows the top 10 movies by average rating
A movie’s page shows its title, average rating, and the 5 most recent reviews
Some movies (blockbusters) have 50,000+ reviews; most have under 100
Genres and directors are fixed sets that rarely change

Which patterns would you apply and why? Write the schema documents.

Exercise 3: Aggregation Optimization

Given 10 million order documents, each with customer_id, total, and created_at, you need to show a dashboard with monthly revenue totals. The dashboard is viewed 5000 times per day. The aggregation takes 30 seconds to run. Apply the Computed Pattern — write the aggregation pipeline and the update strategy for maintaining current-month data.

Introduction

The Cost of Poor Schema Design

Attribute Pattern

Description

Problem It Solves

When to Use

Schema Example

Query Example

Index Creation

Trade-offs

Real-World Use Case

Extended Reference Pattern

Description

Problem It Solves

When to Use

Schema Example

Query Example

Handling Stale Data

Trade-offs

Real-World Use Case

Subset Pattern

Description

Problem It Solves

When to Use

Schema Example

Query Example

Index Strategy

Trade-offs

Real-World Use Case

Computed Pattern

Description

Problem It Solves

When to Use

Schema Example

Aggregation Pipeline (Without Computed Pattern — Slow)

Automating Computation with a Change Stream

Computation via Scheduled Job (Cron)

Trade-offs

Real-World Use Case

Schema Versioning Pattern

Description

Problem It Solves

When to Use

Schema Example

Application Code Handling Versions

Migration Strategy (Lazy Migration)

Batch Migration Script

Trade-offs

Real-World Use Case

Bucket Pattern

Description

Problem It Solves

When to Use

Schema Example

Query Example

Pre-allocating Buckets

Trade-offs

Real-World Use Case

Approximation Pattern

Description

Problem It Solves

When to Use

Schema Example

Using MongoDB’s $sample for Statistical Approximation

Trade-offs

Real-World Use Case

Outlier Pattern

Description

Problem It Solves

When to Use

Schema Example

Application Logic for Outlier Detection

Trade-offs

Real-World Use Case

Decision Matrix

Practice Exercises

Exercise 1: Pattern Identification

Exercise 2: Schema Design

Exercise 3: Aggregation Optimization

Resources