Skip to main content

MongoDB Data Modeling: Design Patterns (Attribute, Extended Reference, Subset, Computed)

MongoDB Data Modeling: Patterns

Created: June 12, 2021 20 min read
Table of Contents

Introduction

MongoDB’s document model is flexible, but flexibility without structure leads to poorly performing databases. Schema design patterns solve recurring data modeling problems — they are battle-tested approaches for document structure, access patterns, and query optimization.

This guide covers eight essential patterns: Attribute, Extended Reference, Subset, Computed, Schema Versioning, Bucket, Approximation, and Outlier. Each pattern solves a specific class of problem, from handling polymorphic fields to managing working set size. You will see concrete schema examples, query patterns, trade-offs, and real-world use cases for each.

The Cost of Poor Schema Design

Before diving into patterns, understand what bad schema design costs:

  • Unnecessary $lookup operations — joining across collections when embedding would serve better, causing N+1 queries
  • Oversized working sets — documents carrying rarely-used fields force MongoDB to page through more data than needed, degrading cache performance
  • Repeated computations — running the same aggregation pipeline every time a page loads instead of storing pre-computed results
  • Schema migration downtime — every application deploy requiring a data migration script because the schema changed

Patterns exist to avoid these costs. Choose the right one for your access patterns and the trade-offs become acceptable.

Attribute Pattern

Description

The Attribute Pattern transforms a set of related fields that share the same characteristics into an array of key-value sub-documents. Instead of having many similar top-level fields, you store them as an array of { k: "<key>", v: "<value>" } pairs. This makes the schema polymorphic — documents in the same collection can have different sets of attributes.

Problem It Solves

You have fields that share a number of common characteristics (same data type, same semantic purpose) and you need to search across them. Without the pattern, you end up with sparse documents containing many fields where most are null.

When to Use

  • Fields that share the same characteristics — for example, product specifications, movie metadata, or event dates across multiple venues
  • Need to index and query across many similar fields
  • Schema where the set of possible attributes is large but any single document uses a small subset
  • Polymorphic data where different document types need different fields

Schema Example

Before the Attribute Pattern, a museum collection document might look like this:

// BAD: sparse fields, hard to query across event venues
{
  "_id": ObjectId("..."),
  "title": "Starry Night",
  "artist": "van Gogh",
  "date_acquisition": ISODate("1941-01-01"),
  "location": "Gallery 502",
  "on_display": true,
  "in_house": true,
  "events": {
    "moma": ISODate("1980-06-01"),
    "louvres": ISODate("1985-09-15"),
    "met": ISODate("1990-03-20")
  }
}

After applying the Attribute Pattern:

// GOOD: polymorphic, indexable, add venues without altering schema
{
  "_id": ObjectId("..."),
  "title": "Starry Night",
  "artist": "van Gogh",
  "location": "Gallery 502",
  "on_display": true,
  "in_house": true,
  "events": [
    { "k": "moma", "v": ISODate("1980-06-01") },
    { "k": "louvres", "v": ISODate("1985-09-15") },
    { "k": "met", "v": ISODate("1990-03-20") }
  ]
}

Query Example

Search across all event venues:

// Find all artworks that were ever exhibited at the Louvre
db.artworks.find({ "events.k": "louvres" })

// Find artworks exhibited at MOMA after 1985
db.artworks.find({
  "events": { $elemMatch: { k: "moma", v: { $gte: ISODate("1985-01-01") } } }
})

Index Creation

db.artworks.createIndex({ "events.k": 1, "events.v": 1 })

Trade-offs

  • Queries become slightly more complex — you must use $elemMatch or dot notation on array elements
  • Application code must translate between key-value pairs and display-ready fields
  • Loses the self-documenting nature of typed top-level fields — "moma" as a field name conveys meaning more clearly than { k: "moma" }

Real-World Use Case

Product catalogs where different categories have different specifications. A laptop has CPU, RAM, and storage specs; a shirt has size, material, and color. Instead of separate collections or documents with hundreds of nullable fields, use the attribute pattern:

db.products.insertOne({
  "sku": "LAP-001",
  "name": "UltraBook Pro",
  "category": "laptop",
  "specs": [
    { "k": "cpu", "v": "Intel i7-13700H" },
    { "k": "ram", "v": "32GB DDR5" },
    { "k": "storage", "v": "1TB NVMe" },
    { "k": "screen", "v": "14-inch 2880x1800" }
  ]
})
db.products.insertOne({
  "sku": "SHT-002",
  "name": "Cotton Oxford",
  "category": "shirt",
  "specs": [
    { "k": "material", "v": "100% Cotton" },
    { "k": "size", "v": "M" },
    { "k": "color", "v": "Navy" }
  ]
})

Extended Reference Pattern

Description

The Extended Reference Pattern copies frequently-accessed fields from a related collection into the current document. Instead of storing only a foreign key (supplier_id) and performing a lookup every time, you embed the most commonly accessed fields from the supplier directly in the product document.

Problem It Solves

Every time you display a product page, you also need supplier information — name, phone number, maybe a rating. Doing a $lookup for each product creates N+1 performance problems. But you also do not want to embed the entire supplier document (which might contain billing addresses, contract terms, etc.) because it bloats the working set.

When to Use

  • One-to-many relationships where the “many” side is frequently read with a subset of the “one” side data
  • Reference data that changes infrequently — supplier name, country code, category label
  • Reducing joins in read-heavy workloads

Schema Example

// suppliers collection (source of truth)
db.suppliers.insertOne({
  "_id": ObjectId("sup001"),
  "name": "Acme Corp",
  "contact_name": "Jane Doe",
  "phone": "+1-555-0123",
  "email": "[email protected]",
  "address": {
    "street": "123 Industrial Blvd",
    "city": "Detroit",
    "state": "MI",
    "zip": "48201"
  },
  "billing_terms": "NET30",
  "contract_expires": ISODate("2027-01-01")
})

// products collection (extended reference — carries frequently-used supplier fields)
db.products.insertOne({
  "_id": ObjectId("prod001"),
  "name": "Widget X-200",
  "price": 29.99,
  "sku": "WG-X200",
  "supplier": {
    "_id": ObjectId("sup001"),
    "name": "Acme Corp",
    "phone": "+1-555-0123"
  }
})

Query Example

Retrieving a product with supplier info requires no join:

db.products.find({ "sku": "WG-X200" }, {
  "name": 1,
  "price": 1,
  "supplier.name": 1,
  "supplier.phone": 1
})

When you need full supplier details, fall back to the source collection:

const product = db.products.findOne({ "sku": "WG-X200" })
const fullSupplier = db.suppliers.findOne({ "_id": product.supplier._id })

Handling Stale Data

The trade-off is redundancy. When a supplier changes their phone number, you must update it everywhere:

// Update the source
db.suppliers.updateOne(
  { "_id": ObjectId("sup001") },
  { $set: { "phone": "+1-555-9999" } }
)
// Update all product copies (batch or background job)
db.products.updateMany(
  { "supplier._id": ObjectId("sup001") },
  { $set: { "supplier.phone": "+1-555-9999" } }
)

Use a scheduled batch job or a change stream to propagate updates if consistency is critical.

Trade-offs

  • Data duplication — updates must propagate to every copy
  • Staleness risk — copies can drift from the source of truth
  • Application complexity — must decide when to use the embedded copy vs. the full source

Real-World Use Case

An e-commerce order system. Every order document embeds the customer’s shipping address and name at the time of purchase. The customer may change their address later, but the order should retain the address that was used for fulfillment:

db.orders.insertOne({
  "order_id": "ORD-2026-001",
  "items": [ { "sku": "WG-X200", "qty": 2, "price": 29.99 } ],
  "total": 59.98,
  "customer": {
    "_id": ObjectId("cust001"),
    "name": "Alice Smith",
    "shipping_address": {
      "street": "456 Oak Ave",
      "city": "Portland",
      "state": "OR",
      "zip": "97201"
    }
  },
  "created_at": ISODate("2026-04-20")
})

Subset Pattern

Description

The Subset Pattern splits a document into two parts: a frequently-accessed subset and a rarely-accessed full document. The “hot” portion stays in the primary collection (fits in working set memory), while the “cold” details live in a separate collection or as a secondary document.

Problem It Solves

The working set — the data and indexes MongoDB keeps in RAM for optimal performance — grows too large. When it exceeds available memory, MongoDB must page data from disk, degrading read performance. The solution is to reduce the size of frequently-accessed documents.

When to Use

  • Working set exceeds available RAM
  • Documents contain large fields (arrays of comments, full-text descriptions, images as base64) that are rarely accessed in the primary query
  • Scaling vertically (adding RAM) is expensive or impractical
  • Read-heavy workload where most queries only need a subset of fields

Schema Example

Product reviews are a classic case. A product page shows ratings, a snippet, and maybe 5 reviews. But the product might have 10,000 reviews.

// Primary collection — frequently accessed, small documents
db.products.insertOne({
  "_id": ObjectId("prod001"),
  "name": "UltraBook Pro",
  "price": 1299.99,
  "rating_avg": 4.5,
  "review_count": 10234,
  "top_reviews": [
    {
      "review_id": ObjectId("rev9999"),
      "rating": 5,
      "headline": "Best laptop I have owned",
      "author": "tech_guru",
      "created_at": ISODate("2026-04-01")
    },
    {
      "review_id": ObjectId("rev10000"),
      "rating": 4,
      "headline": "Great but battery could be better",
      "author": "nomad_dev",
      "created_at": ISODate("2026-04-15")
    }
  ]
})

// Secondary collection — full details, less frequently accessed
db.reviews.insertOne({
  "_id": ObjectId("rev9999"),
  "product_id": ObjectId("prod001"),
  "rating": 5,
  "headline": "Best laptop I have owned",
  "body": "I have been using this for three months. The build quality is exceptional, the screen is gorgeous, and it handles Docker, Kubernetes, and multiple VMs without breaking a sweat...",
  "author": "tech_guru",
  "helpful_count": 342,
  "created_at": ISODate("2026-04-01")
})

Query Example

Pagination for full reviews (cold data):

db.reviews.find({ "product_id": ObjectId("prod001") })
  .sort({ "created_at": -1 })
  .skip(20)
  .limit(10)

Index Strategy

// Primary collection index — light, fits in RAM
db.products.createIndex({ "rating_avg": -1 })

// Secondary collection index — supports pagination
db.reviews.createIndex({ "product_id": 1, "created_at": -1 })

Trade-offs

  • Application must manage two collections — determines what goes in the subset vs. the full document
  • Pagination requires a query to the secondary collection, which may hit disk
  • Consistency concern: review count in the product document must be updated atomically

Real-World Use Case

E-commerce product listings. A search results page shows 50 products with name, price, rating, and thumbnail URL. That is the subset. The full product page includes descriptions, specifications, reviews, and related products — all loaded on-demand from secondary collections or cached after the first page load.

Computed Pattern

Description

The Computed Pattern stores the result of an expensive computation directly in the document instead of recalculating it on every read. The computation runs once (on write, via a scheduled job, or via a change stream) and subsequent reads are instant lookups.

Problem It Solves

Aggregations over large datasets — sums, averages, counts, or complex transformations — are expensive. Running them on every page load wastes CPU and increases latency.

When to Use

  • Read-heavy workloads where the same computation runs repeatedly
  • Aggregation pipelines that scan many documents (summing order totals, averaging ratings)
  • Computations that do not need real-time precision (acceptable staleness window)
  • Dashboard metrics, leaderboards, summary statistics

Schema Example

An e-commerce store needs to show total revenue per product. Instead of summing all orders every time:

// Computed field stored directly on the document
db.products.updateOne(
  { "_id": ObjectId("prod001") },
  { $set: { "total_revenue": 152340.50, "total_units_sold": 1124 } }
)

Aggregation Pipeline (Without Computed Pattern — Slow)

// Every page load: scan thousands of orders, group, sum — expensive
db.orders.aggregate([
  { $match: { "product_id": ObjectId("prod001") } },
  { $group: {
      _id: "$product_id",
      total_revenue: { $sum: { $multiply: ["$price", "$quantity"] } },
      total_units: { $sum: "$quantity" }
  }}
])

Automating Computation with a Change Stream

// MongoDB shell — watch orders collection and update computed values
const pipeline = [
  { $match: { "operationType": { $in: ["insert", "update"] } } }
]
const changeStream = db.orders.watch(pipeline)

changeStream.on("change", (change) => {
  const order = change.fullDocument
  db.products.updateOne(
    { "_id": order.product_id },
    { $inc: { "total_revenue": order.price * order.quantity, "total_units_sold": order.quantity } }
  )
})

Computation via Scheduled Job (Cron)

// Run every hour — recompute top seller metrics
db.products.aggregate([
  { $lookup: {
      from: "orders",
      localField: "_id",
      foreignField: "product_id",
      as: "orders"
  }},
  { $addFields: {
      "total_revenue": { $sum: "$orders.revenue" },
      "total_units_sold": { $sum: "$orders.quantity" }
  }},
  { $unset: "orders" },
  { $merge: { into: "products", on: "_id", whenMatched: "merge" } }
])

Trade-offs

  • Staleness — computed values are only as fresh as the last update
  • Write overhead — every relevant write must trigger a recompute or increment update
  • Complexity — must decide how frequently to recompute (realtime, every N minutes, daily)

Real-World Use Case

A leaderboard for a gaming platform. Computing player ranks by summing scores across thousands of matches per player is expensive. Instead, maintain a players.rank_score field that updates incrementally after each match:

// After each match, increment the winner's score atomically
db.players.updateOne(
  { "_id": match.winner_id },
  { $inc: { "rank_score": match.points_earned, "matches_played": 1 } }
)
// Leaderboard query is now a simple sort
db.players.find().sort({ "rank_score": -1 }).limit(100)

Schema Versioning Pattern

Description

The Schema Versioning Pattern allows documents with different schema versions to coexist in the same collection. Each document carries a schema_version field (or equivalent) that tells the application how to interpret it. The application handles both versions transparently.

Problem It Solves

Schema evolution in production. Changing a document’s structure usually requires a migration — updating every document. For large collections with hundreds of millions of documents, a full migration can take hours or days, requiring downtime or complex blue-green deployments.

When to Use

  • Schema changes are frequent or unpredictable
  • Collection is too large to migrate in one deploy cycle
  • Zero-downtime deploys are required
  • Different services with different code versions access the same collection

Schema Example

Version 1 of a user profile:

db.users.insertOne({
  "_id": ObjectId("..."),
  "schema_version": 1,
  "username": "alice",
  "email": "[email protected]",
  "full_name": "Alice Smith",
  "address": "456 Oak Ave, Portland, OR 97201",
  "created_at": ISODate("2024-01-01")
})

Version 2 splits address into components and adds phone:

db.users.insertOne({
  "_id": ObjectId("..."),
  "schema_version": 2,
  "username": "bob",
  "email": "[email protected]",
  "full_name": "Bob Jones",
  "address": {
    "street": "123 Pine St",
    "city": "Seattle",
    "state": "WA",
    "zip": "98101"
  },
  "phone": "+1-555-7777",
  "created_at": ISODate("2026-04-01")
})

Application Code Handling Versions

// Application layer — read both versions transparently
function formatUserDisplay(user) {
  if (user.schema_version === 1) {
    return {
      displayName: user.full_name,
      location: user.address      // flat string
    }
  }
  if (user.schema_version === 2) {
    return {
      displayName: user.full_name,
      location: `${user.address.city}, ${user.address.state}`
    }
  }
  throw new Error(`Unknown schema version: ${user.schema_version}`)
}

Migration Strategy (Lazy Migration)

Instead of migrating all at once, migrate documents as they are accessed:

// On read, migrate in place
function readUser(userId) {
  const user = db.users.findOne({ "_id": userId })
  if (user.schema_version < 2) {
    const parts = user.address.split(", ")
    db.users.updateOne(
      { "_id": userId },
      { $set: {
          "schema_version": 2,
          "address.street": parts[0],
          "address.city": parts[1],
          "address.state": parts[2].split(" ")[0],
          "address.zip": parts[3]
      }}
    )
  }
  return user
}

Batch Migration Script

// Background job — migrate 1000 documents at a time
let migrated = 0
const batchSize = 1000
while (true) {
  const batch = db.users.find({ "schema_version": { $lt: 2 } }).limit(batchSize).toArray()
  if (batch.length === 0) break
  for (const user of batch) {
    const parts = user.address.split(", ")
    db.users.updateOne(
      { "_id": user._id },
      { $set: {
          "schema_version": 2,
          "address.street": parts[0],
          "address.city": parts[1],
          "address.state": parts[2].split(" ")[0],
          "address.zip": parts[3]
      }}
    )
  }
  migrated += batch.length
  print(`Migrated ${migrated} documents so far...`)
}

Trade-offs

  • Application code must handle multiple versions — adds conditional logic
  • Old data may never be migrated if you rely solely on lazy migration
  • Indexes must work across all versions — a field that exists in v2 but not v1 will be indexed as null for v1 documents
  • Eventual consistency — different services might see different versions of the same document

Real-World Use Case

SaaS platforms where different customers have different feature sets. A CRM might add custom fields per customer. New fields are added to documents as the customer enables them, and old documents without those fields are handled gracefully by the application.

Bucket Pattern

Description

The Bucket Pattern groups related time-series or sequential data points into a single document instead of storing each data point as its own document. A “bucket” covers a time window (e.g., one hour) and contains an array of data points within that window.

Problem It Solves

Time-series data — sensor readings, logs, metrics — produces a massive number of documents. Each document has overhead (12 bytes for _id, index entries, document metadata). Storing one document per data point wastes storage and degrades write throughput.

When to Use

  • IoT sensor data, application logs, stock tickers, analytics events
  • Data points arrive frequently (every second, every minute)
  • Queries typically ask for a range of time rather than individual points
  • Reducing document count per time range improves index efficiency

Schema Example

Instead of one document per temperature reading:

// BAD: one document per reading — millions of tiny documents
db.sensor_data.insertOne({ "sensor_id": "sensor01", "ts": ISODate("2026-04-26T10:00:00Z"), "temp": 22.5 })
db.sensor_data.insertOne({ "sensor_id": "sensor01", "ts": ISODate("2026-04-26T10:01:00Z"), "temp": 22.7 })
db.sensor_data.insertOne({ "sensor_id": "sensor01", "ts": ISODate("2026-04-26T10:02:00Z"), "temp": 22.6 })

Group readings into hourly buckets:

// GOOD: one document per sensor per hour — 24 documents/day instead of 1440
db.sensor_data.insertOne({
  "sensor_id": "sensor01",
  "bucket_start": ISODate("2026-04-26T10:00:00Z"),
  "bucket_end": ISODate("2026-04-26T11:00:00Z"),
  "readings_count": 60,
  "readings": [
    { "t": ISODate("2026-04-26T10:00:00Z"), "v": 22.5 },
    { "t": ISODate("2026-04-26T10:01:00Z"), "v": 22.7 },
    { "t": ISODate("2026-04-26T10:02:00Z"), "v": 22.6 }
  ],
  "stats": {
    "min": 22.1,
    "max": 23.0,
    "avg": 22.6
  }
})

Query Example

Get all readings for a sensor in a time range:

db.sensor_data.find({
  "sensor_id": "sensor01",
  "bucket_start": { $gte: ISODate("2026-04-26T00:00:00Z") },
  "bucket_end": { $lte: ISODate("2026-04-27T00:00:00Z") }
})

Unwind to get individual readings:

db.sensor_data.aggregate([
  { $match: { "sensor_id": "sensor01", "bucket_start": { $gte: ISODate("2026-04-26T10:00:00Z") } } },
  { $unwind: "$readings" },
  { $match: { "readings.t": { $gte: ISODate("2026-04-26T10:05:00Z"), $lte: ISODate("2026-04-26T10:15:00Z") } } }
])

Pre-allocating Buckets

For predictable time-series data, pre-allocate buckets and use $push to append:

// Pre-allocate the hour bucket on first write
db.sensor_data.updateOne(
  { "sensor_id": "sensor01", "bucket_start": ISODate("2026-04-26T11:00:00Z") },
  { $setOnInsert: {
      "sensor_id": "sensor01",
      "bucket_start": ISODate("2026-04-26T11:00:00Z"),
      "bucket_end": ISODate("2026-04-26T12:00:00Z"),
      "readings": [],
      "readings_count": 0
  }},
  { upsert: true }
)
// Append reading to existing bucket
db.sensor_data.updateOne(
  { "sensor_id": "sensor01", "bucket_start": ISODate("2026-04-26T11:00:00Z") },
  { $push: { "readings": { "t": new Date(), "v": 22.8 } }, $inc: { "readings_count": 1 } }
)

Trade-offs

  • Maximum document size (16 MB) limits bucket capacity — at 100 bytes per reading, one bucket holds ~160,000 readings
  • Write amplification — each $push adds an array element and updates the document’s position on disk
  • Query complexity — individual point lookups require $unwind or $elemMatch
  • Choosing bucket boundaries requires domain knowledge (hourly, daily, by event count)

Real-World Use Case

IoT temperature monitoring across a warehouse. Each sensor reports every 30 seconds. 50 sensors × 2 readings/minute × 60 minutes × 24 hours = 144,000 documents/day without bucketing. With hourly buckets, that becomes 50 × 24 = 1,200 documents/day — a 99% reduction.

Approximation Pattern

Description

The Approximation Pattern stores a calculated estimate instead of an exact value. The computation for exact values is expensive or involves large scans, but the use case tolerates small errors.

Problem It Solves

Exact counts or aggregations over large collections are expensive. An approximation — computed periodically or sampled — serves the same practical purpose with a fraction of the cost.

When to Use

  • Exact precision is not required — counts within 5-10% are acceptable
  • Expensive computations like $count over millions of documents, or distinct value counts on high-cardinality fields
  • Dashboards, badges, “trending” indicators, counters that drive UI state

Schema Example

Instead of db.posts.countDocuments({ "author": "alice" }) every time the profile page loads:

// Computed approximation — updated daily via cron
db.posts.updateOne(
  { "author": "alice" },
  { $set: { "approx_post_count": 342 } }  // real count was 347
)
// Display on profile — 342 vs 347 makes no practical difference

Using MongoDB’s $sample for Statistical Approximation

// Approximate the average order value without scanning all orders
db.orders.aggregate([
  { $sample: { size: 10000 } },          // random sample of 10k orders
  { $group: {
      _id: null,
      approx_avg_value: { $avg: "$total" },
      approx_total_count: { $sum: 1 }
  }},
  { $project: {
      approx_avg_value: 1,
      // Scale up the count estimate
      approx_total_orders: { $multiply: ["$approx_total_count", 100] }
  }}
])

Trade-offs

  • Inherent inaccuracy — must know the error tolerance of your stakeholders
  • Batch update cost — if updated too frequently, the approximation loses its advantage
  • Not suitable for financial calculations, inventory counts, or any system where exact values are required by regulation

Real-World Use Case

“Trending topics” on a social platform. Exact engagement counts require scanning millions of interactions. Instead, sample a fraction of interactions, compute relative scores, and display top trending items. Whether “Azure Functions” has 12,450 or 12,847 mentions does not change whether it is trending.

Outlier Pattern

Description

The Outlier Pattern handles documents that do not fit the normal schema or access pattern. Instead of making every document handle the exceptional case — adding nullable fields, bloating indexes — isolate the outlier behavior into a separate field or separate collection.

Problem It Solves

A small percentage of documents require more fields, larger arrays, or different queries than the majority. Applying the general-purpose solution to all documents penalizes the 99% for the sake of the 1%.

When to Use

  • A small subset of documents has significantly more data than the rest (e.g., a viral post with 100k comments vs. a normal post with 20 comments)
  • Power users need additional fields that 99% of users do not use
  • Documents that exceed the 16 MB limit if treated uniformly

Schema Example

A social media post — most posts have a handful of comments, but a viral post has hundreds of thousands. Instead of embedding all comments (which would cause the document to grow unboundedly), store normal posts with embedded comments and outlier posts with a comment reference:

// Normal post — comments embedded (99.9% of posts)
db.posts.insertOne({
  "_id": ObjectId("post001"),
  "title": "My first day with Go",
  "content": "I started learning Go today...",
  "comments": [
    { "author": "dev_guru", "text": "Welcome to Go!", "created_at": ISODate("2026-04-26") },
    { "author": "rust_fan", "text": "You should try Rust too.", "created_at": ISODate("2026-04-26") }
  ],
  "comment_count": 2
})

// Viral post — outlier detected, comments stored separately
db.posts.insertOne({
  "_id": ObjectId("post002"),
  "title": "How I deployed to production on a Friday",
  "content": "It was a calm afternoon...",
  "comments": [],                         // no embedded comments
  "comment_count": 143892,
  "outlier": true,                        // flag for the application
  "comments_ref": "post002_comments"      // reference to separate collection
})

// Separate collection for outlier comments
db.post002_comments.insertOne({
  "post_id": ObjectId("post002"),
  "author": "sre_ninja",
  "text": "This is why we have change advisory boards.",
  "created_at": ISODate("2026-04-26")
})

Application Logic for Outlier Detection

function getComments(postId) {
  const post = db.posts.findOne({ "_id": postId })
  if (post.outlier) {
    // Use paginated query for viral posts
    return db[post.comments_ref].find({ "post_id": postId })
      .sort({ "created_at": -1 })
      .limit(20)
      .toArray()
  }
  // Normal post — return embedded comments directly
  return post.comments
}

Trade-offs

  • Application logic must check for the outlier flag on every read
  • Two different access patterns — more code paths to test
  • Outlier threshold must be tuned — set too low and most documents become “outliers,” defeating the purpose
  • Indexes on the main collection do not cover outlier data

Real-World Use Case

A GitHub repository with millions of stars. Storing all stargazer data in the repo document is impossible — even just user IDs would exceed 16 MB. Normal repos store a stargazer count only. For the top 0.1% of repos, detailed stargazer data is stored in a separate collection and accessed only when the user clicks “View stargazers.”

Decision Matrix

Pattern Best For Trade-offs When to Avoid
Attribute Polymorphic fields, shared-field queries across sparse schemas Complex queries with $elemMatch, loses self-documenting field names Fixed, well-known fields with simple queries
Extended Reference Reducing joins for frequently-read reference data Data duplication, staleness risk, update propagation Data changes frequently or must be transactionally consistent
Subset Working set too large for RAM, read-heavy with cold data Two-collection management, pagination overhead All data is accessed uniformly, writes dominate reads
Computed Expensive aggregations on read-heavy workloads Staleness window, write amplification from recompute Real-time accuracy required, write-heavy workloads
Schema Versioning Zero-downtime schema evolution, large collections Application complexity, lazy migration leaves old data Small collections that can be migrated quickly in one deploy
Bucket Time-series data with frequent, small data points 16 MB document limit, complex queries for individual points Data points arrive infrequently, or high-latency requirements for point queries
Approximation Large-scale counts/aggregations where precision is not critical Inherent inaccuracy, batch update cost Financial/regulatory contexts, inventory management
Outlier Small percentage of documents with extreme data volumes Two access patterns, threshold tuning, more code paths Most documents are outliers (rethink schema), uniform data distribution

Practice Exercises

Exercise 1: Pattern Identification

For each scenario, identify the best pattern:

  1. A weather service collects temperature and humidity readings every 10 seconds from 10,000 sensors.
  2. An e-commerce site displays a product page with supplier name and phone. Supplier has 50 other fields stored elsewhere.
  3. A social media app shows a “trending” counter on posts — exact counts are expensive but a daily update suffices.
  4. A CRM needs to add custom fields per customer without downtime or migration scripts.
  5. A blog platform embeds comments in posts, but a viral post has 500,000 comments and causes document growth issues.

Exercise 2: Schema Design

Design a schema for a movie review platform:

  • Movies have titles, release dates, genres, and directors
  • Users write reviews with ratings (1-5) and text
  • The homepage shows the top 10 movies by average rating
  • A movie’s page shows its title, average rating, and the 5 most recent reviews
  • Some movies (blockbusters) have 50,000+ reviews; most have under 100
  • Genres and directors are fixed sets that rarely change

Which patterns would you apply and why? Write the schema documents.

Exercise 3: Aggregation Optimization

Given 10 million order documents, each with customer_id, total, and created_at, you need to show a dashboard with monthly revenue totals. The dashboard is viewed 5000 times per day. The aggregation takes 30 seconds to run. Apply the Computed Pattern — write the aggregation pipeline and the update strategy for maintaining current-month data.

Resources

Comments

Share this article

Scan to read on mobile