MongoDB Data Modeling: Complete Guide & Best Practices

Data modeling is the most impactful design decision you make in MongoDB. Get it right and your application scales effortlessly, queries fly, and hardware stays modest. Get it wrong and you fight performance problems, write increasingly complex workarounds, and burn operational budget on unnecessary resources.

This guide covers the complete MongoDB data modeling landscape: the embed-versus-reference decision framework, all nine schema design patterns, real-world workload estimation, schema validation, IoT and e-commerce modeling patterns, and production scaling considerations. Every section includes working code examples you can adapt immediately.

The Core Decision: Embed vs Reference

Every MongoDB schema design starts with one fundamental choice: do you embed the related data inside the parent document, or do you store a reference and resolve it at query time? There is no universal right answer. The correct choice depends on your data access patterns, consistency requirements, and growth characteristics.

Decision Framework

The table below enumerates every factor that influences the embed-or-reference decision, along with guidance for each case.

Factor	Embed When	Reference When
Data access pattern	Child data is almost always read with the parent	Child data is queried independently from the parent
Document growth	Subdocuments have bounded growth (fixed array, limited entries)	Subdocuments grow without bound (unlimited comments, log entries)
Data locality	You need sub-5ms reads that include related data	Network round trips for joins are acceptable
Atomicity	Updates to parent and child must be atomic and isolated	You can tolerate eventual consistency across collections
Write frequency	Child data is read-heavy, write-light	Child data is written every second independently
Document size	Total document stays under 16 MB with headroom	Embedded data would push the document past 4-8 MB
Duplication cost	Duplicated data changes rarely or never	Duplicated data changes frequently and must be consistent
Query simplicity	You want single-query reads with no joins	You prefer normalized storage and accept application-level joins
Index efficiency	You want to index on fields across parent and child	Your access patterns benefit from collection-level indexes
Sharding	Child data must live on the same shard as the parent	Child documents need their own shard key distribution

When three or more factors point in the same direction, the decision is clear. When factors are evenly split, start with embedding and extract to a separate collection only when you hit a concrete bottleneck.

Embedded Document Pattern

Embedding stores related data directly inside the parent document. This is the natural starting point for MongoDB schemas because it provides single-operation reads and atomic updates.

// Embedded pattern: order contains line items directly
db.orders.insertOne({
  orderId: "ORD-2026-001",
  customerId: "CUST-4382",
  orderDate: new Date("2026-04-20T14:30:00Z"),
  status: "shipped",
  items: [
    { sku: "MONGO-BOOK", title: "MongoDB Data Modeling Guide", qty: 1, price: 49.99 },
    { sku: "MONGO-TEE", title: "MongoDB Logo T-Shirt", qty: 2, price: 24.99 }
  ],
  shippingAddress: {
    street: "55 Main Street",
    city: "New York",
    state: "NY",
    zip: "10001"
  },
  total: 99.97
})

Referenced Document Pattern with $lookup

Reference by ObjectId when the related data is large, independently queried, or grows without bound. Use the aggregation pipeline’s $lookup stage to reassemble the data at read time.

// Orders collection (stores only the reference)
db.orders.insertOne({
  orderId: "ORD-2026-001",
  customerId: "CUST-4382",
  orderDate: new Date("2026-04-20T14:30:00Z"),
  status: "shipped",
  itemIds: [ObjectId("67f8a1b2c3d4e5f6a7b8c9d0"), ObjectId("67f8a1b2c3d4e5f6a7b8c9d1")],
  total: 99.97
})

// Items collection (normalized independently)
db.items.insertMany([
  { _id: ObjectId("67f8a1b2c3d4e5f6a7b8c9d0"), sku: "MONGO-BOOK", title: "MongoDB Data Modeling Guide", price: 49.99, stock: 230 },
  { _id: ObjectId("67f8a1b2c3d4e5f6a7b8c9d1"), sku: "MONGO-TEE", title: "MongoDB Logo T-Shirt", price: 24.99, stock: 540 }
])

// Reassemble with $lookup
db.orders.aggregate([
  { $match: { orderId: "ORD-2026-001" } },
  { $lookup: {
      from: "items",
      localField: "itemIds",
      foreignField: "_id",
      as: "items"
  }}
])

Schema Design Patterns Summary

MongoDB University’s M320 course introduced nine battle-tested schema design patterns. Each solves a specific class of problem. The table below summarizes every pattern with its use case, trade-off, and a concrete example.

Pattern	Use Case	Trade-Off	Example
Attribute	Many similar fields, sparse data, or EAV-style models	Harder to validate; querying by attribute key requires `$exists` checks	Product specifications where different categories have different attributes
Extended Reference	Frequently repeated `$lookup` across collections creates read overhead	Duplicates data; write path must keep copies in sync	Inventory document embeds frequently-read order summary fields to avoid joins
Subset	Large documents with rarely-used fields slow down working set and I/O	Application must handle two collections and occasional refetch	Store product details in one collection and large PDF safety sheets in another
Computed	Expensive aggregations run repeatedly on the same data	Stale values if not updated promptly; extra write-time cost	Pre-compute daily genre popularity counts instead of scanning all orders
Bucket	Time-series or IoT data written at high frequency with periodic reads	Loses individual event granularity; bucket boundaries are fixed	Group sensor readings into hourly buckets instead of one document per reading
Schema Versioning	Documents must evolve in production without downtime	Application must handle multiple schema versions in code	Add `schema_version: 2` field; new code reads old and new formats
Polymorphic	Documents of different shapes live in the same collection	Queries and indexes must account for multiple document types	E-commerce inventory: electronics, clothing, groceries all in one collection
Outlier	A small number of documents buck the normal access or size pattern	Adds complexity to handle the special-case documents	A bestselling book with 500,000 reviews while most books have fewer than 100
Approximation	Exact counts are not needed and write volume is high	Value is approximate, never exact	Use a counter document updated every N writes instead of every write

Attribute Pattern

Use the Attribute pattern when documents share the same structure but store different sets of sparse fields. Instead of dozens of nullable columns, use an array of key-value pairs.

// Product specifications with the Attribute pattern
db.products.insertOne({
  sku: "CAM-1001",
  category: "Camera",
  attributes: [
    { key: "megapixels", value: 24.2 },
    { key: "sensor_type", value: "CMOS" },
    { key: "iso_range", value: "100-25600" },
    { key: "video_resolution", value: "4K" }
  ]
})

// Query all cameras with at least 20 megapixels
db.products.find({
  category: "Camera",
  "attributes": { $elemMatch: { key: "megapixels", value: { $gte: 20 } } }
})

Extended Reference Pattern

When a high-read-path $lookup creates latency, copy the most-accessed fields from the related collection into the parent document.

// Before: every inventory read requires a lookup on orders
// Inventory document with Extended Reference
db.inventory.insertOne({
  sku: "MONGO-BOOK",
  title: "MongoDB Data Modeling Guide",
  price: 49.99,
  stock: 230,
  recentOrders: {
    totalSold: 1842,
    last30Days: 147,
    avgRating: 4.7
  }
})

Subset Pattern

Documents that carry a mix of hot (frequently accessed) and cold (rarely accessed) fields benefit from splitting into two collections. The main collection stays small and stays in the working set.

// Inventory collection: hot fields only
db.inventory.insertOne({
  sku: "CHEM-AMMONIA",
  name: "Ammonium Hydroxide",
  price: 14.50,
  qtyOnHand: 88,
  warehouse: "WH-NJ-03"
})

// Safety documents collection: cold fields with large PDF metadata
db.safetyDocs.insertOne({
  sku: "CHEM-AMMONIA",
  sdsUrl: "/pdf/msds-ammonium-hydroxide.pdf",
  sdsSize: 2840000,
  storageClass: "Flammable",
  handlingNotes: "Store below 25°C in ventilated area",
  lastReviewed: new Date("2026-03-15")
})

Schema Validation in MongoDB

MongoDB’s flexible schema does not mean no schema. Use $jsonSchema to enforce document structure, field types, required fields, and value ranges at the database level. Validation runs on every insert and update.

// Create a validated collection
db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["orderId", "customerId", "items", "total"],
      properties: {
        orderId: {
          bsonType: "string",
          description: "must be a string and is required"
        },
        customerId: {
          bsonType: "string",
          description: "must be a string and is required"
        },
        total: {
          bsonType: "number",
          minimum: 0,
          description: "must be a non-negative number"
        },
        status: {
          enum: ["pending", "processing", "shipped", "delivered", "cancelled"],
          description: "must be one of the enumerated status values"
        },
        items: {
          bsonType: "array",
          minItems: 1,
          items: {
            bsonType: "object",
            required: ["sku", "qty", "price"],
            properties: {
              sku: { bsonType: "string" },
              qty: { bsonType: "int", minimum: 1 },
              price: { bsonType: "number", minimum: 0 }
            }
          }
        }
      }
    }
  },
  validationAction: "error"
})

// This insert succeeds
db.orders.insertOne({
  orderId: "ORD-2026-002",
  customerId: "CUST-8192",
  items: [{ sku: "MONGO-BOOK", qty: 1, price: 49.99 }],
  total: 49.99,
  status: "pending"
})

// This insert fails validation (negative total)
db.orders.insertOne({
  orderId: "ORD-2026-003",
  customerId: "CUST-8192",
  items: [{ sku: "MONGO-BOOK", qty: 1, price: 49.99 }],
  total: -10
})

Validation actions control how MongoDB responds:

validationAction: "error" rejects non-conforming writes.
validationAction: "warn" allows the write but logs a warning. Use during migration from an unvalidated to a validated schema.
validationLevel: "strict" applies validation to all inserts and updates. This is the default.
validationLevel: "moderate" applies validation only to documents that already pass validation. Useful for gradual enforcement across an existing collection.

Real-World Workload Estimation

Before you design indexes or choose a shard key, quantify your workload. Use concrete numbers to determine write throughput, storage sizing, and whether sharding is necessary.

Consider a navigation application with 10 million active devices, each sending 100 bytes every minute. Peak usage climbs to 50 million devices. Data must be retained for one year.

# Storage calculation
10,000,000 devices * 100 bytes/min * 60 * 24 * 365 min/year = 5.256 * 10^15 bytes
Rounded: 530 terabytes

# Average write rate
10,000,000 devices / 60 seconds = 166,666 writes/second
Rounded: 170,000 writes/second

# Peak write rate
50,000,000 devices / 60 seconds = 833,333 writes/second
Rounded: 830,000 writes/second

At 170,000 average writes per second and 530 TB of storage, this workload demands careful index planning and almost certainly requires sharding. A single mongod instance cannot sustain that write volume while maintaining a reasonable working set. The shard key must distribute writes evenly across all shards, avoiding any single shard becoming a bottleneck.

// Index for the workload: compound index covering common query patterns
db.locations.createIndex(
  { deviceId: 1, timestamp: -1 },
  { background: true }
)

// TTL index to automatically purge data older than 365 days
db.locations.createIndex(
  { timestamp: 1 },
  { expireAfterSeconds: 31536000 }
)

Production Scaling Considerations

Shard Key Selection

Choose a shard key that provides write distribution and supports your most frequent query patterns. Avoid monotonically increasing shard keys like {timestamp: 1} because all new writes hit the last shard. Prefer hashed shard keys or compound keys that include a high-cardinality field.

// Hashed shard key for even write distribution
sh.shardCollection("nav.locationUpdates", { deviceId: "hashed" })

// Compound shard key that supports range queries and distributes writes
sh.shardCollection("nav.locationUpdates", { deviceId: 1, timestamp: -1 })

Working Set Sizing

The working set is the portion of data accessed most frequently. If the working set fits in RAM, MongoDB serves reads from memory and delivers sub-millisecond latency. If it exceeds RAM, MongoDB pages data from disk, increasing latency by orders of magnitude.

// Check working set size via serverStatus
db.adminCommand({ serverStatus: 1 }).wiredTiger.cache

// Monitor page faults
db.serverStatus().extra_info.page_faults

When the working set grows beyond available RAM, consider:

Adding more memory to the replica set members
Using the Subset pattern to shrink document sizes
Adding indexes to reduce the number of documents scanned per query
Sharding to distribute the working set across multiple nodes

Index Planning

Each index on a collection adds write overhead proportional to the number of index entries. Index planning in MongoDB is a balance: enough indexes to serve queries efficiently, few enough that writes do not degrade.

// Single-field index for equality lookups
db.orders.createIndex({ customerId: 1 })

// Compound index that covers the query (no document fetch needed)
db.orders.createIndex(
  { customerId: 1, orderDate: -1, total: 1 },
  { name: "customer_orders_recents" }
)

// Text index for full-text search on product descriptions
db.products.createIndex(
  { title: "text", description: "text" },
  { weights: { title: 10, description: 5 } }
)

Use explain() to validate that your queries use the expected index:

db.orders.find({ customerId: "CUST-4382", orderDate: { $gte: ISODate("2026-01-01") } })
  .sort({ orderDate: -1 })
  .explain("executionStats")

Data Modeling for IoT and Time-Series

IoT workloads share a common profile: high-frequency writes from many devices, periodic reads over time windows, and a need to purge old data efficiently. The Bucket pattern is the standard solution.

Instead of one document per reading, batch readings into fixed-time buckets. This reduces the total document count, improves index efficiency, and makes TTL-based expiration straightforward.

// Without bucketing: one document per reading (50 million docs/month for 1000 devices)
// With bucketing: hourly bucket (720 buckets/month per device)

// Insert an hourly bucket
db.lightSensorReadings.updateOne(
  {
    deviceId: "LIGHT-FL02-ZONE4",
    bucket: "2026-04-20-14"
  },
  {
    $setOnInsert: {
      deviceId: "LIGHT-FL02-ZONE4",
      bucket: "2026-04-20-14",
      createdAt: new Date("2026-04-20T14:00:00Z"),
      metadata: { floor: 2, zone: "Zone 4", type: "LED-PIR" }
    },
    $push: {
      readings: {
        $each: [
          { ts: new Date("2026-04-20T14:00:10Z"), value: 320 },
          { ts: new Date("2026-04-20T14:00:20Z"), value: 315 },
          { ts: new Date("2026-04-20T14:00:30Z"), value: 310 }
        ]
      }
    },
    $inc: { readingCount: 3 }
  },
  { upsert: true }
)

// Create TTL index on the bucket creation time
// Data older than 5 years is automatically deleted
db.lightSensorReadings.createIndex(
  { createdAt: 1 },
  { expireAfterSeconds: 157680000 }
)

// Hourly report: aggregate readings within a bucket
// Management wants hourly summaries, not per-second data
db.lightSensorReadings.aggregate([
  { $match: { deviceId: "LIGHT-FL02-ZONE4", bucket: { $gte: "2026-04-20", $lt: "2026-04-21" } } },
  { $unwind: "$readings" },
  { $group: {
      _id: { hour: { $substr: ["$bucket", 11, 2] } },
      avgLux: { $avg: "$readings.value" },
      minLux: { $min: "$readings.value" },
      maxLux: { $max: "$readings.value" },
      sampleCount: { $sum: 1 }
  }},
  { $sort: { "_id.hour": 1 } }
])

Data Modeling for E-Commerce

E-commerce catalogs carry products of wildly different types — electronics, clothing, groceries, hardware — each with its own set of attributes. The Polymorphic pattern handles this naturally by allowing documents of different shapes to coexist in the same collection.

// Polymorphic inventory collection: diverse products in one collection
db.inventory.insertMany([
  {
    type: "electronics",
    sku: "ELEC-001",
    name: "Wireless Bluetooth Headphones",
    price: 89.99,
    brand: "SoundMax",
    specifications: {
      batteryLife: "30 hours",
      bluetoothVersion: "5.3",
      driverSize: "40mm"
    },
    warrantyMonths: 24
  },
  {
    type: "clothing",
    sku: "CLTH-422",
    name: "Merino Wool Base Layer",
    price: 79.99,
    brand: "TrailGear",
    specifications: {
      material: "100% Merino Wool",
      sizes: ["S", "M", "L", "XL"],
      color: "Charcoal",
      careInstructions: "Machine wash cold, lay flat to dry"
    }
  },
  {
    type: "groceries",
    sku: "GROC-887",
    name: "Organic Cold-Pressed Olive Oil",
    price: 16.99,
    brand: "Mediterra",
    specifications: {
      volume: "750ml",
      origin: "Greece",
      organic: true,
      expirationDate: new Date("2027-06-15")
    }
  }
])

// Create indexes that work across polymorphic documents
db.inventory.createIndex({ type: 1, "specifications.material": 1 })
db.inventory.createIndex({ type: 1, price: 1 })
db.inventory.createIndex({ name: "text", "specifications.material": "text" })

The Polymorphic pattern also enables cross-category search. A single query can filter by type and sort by price across all product categories.

// Customers can search for all products under $50, sorted by price
db.inventory.find({ price: { $lte: 50 } }).sort({ price: 1 }).limit(20)

Schema Versioning for Live Migrations

Production schemas evolve. The Schema Versioning pattern lets you roll out structural changes without taking downtime. Each document carries a schema_version field, and the application handles both old and new formats during the transition.

// Old document format
db.inventory.insertOne({
  schema_version: 1,
  sku: "OLD-FORMAT-001",
  name: "Widget",
  price: 9.99,
  category: "hardware"
})

// New document format after migration
db.inventory.insertOne({
  schema_version: 2,
  sku: "NEW-FORMAT-001",
  name: "Widget Pro",
  price: 14.99,
  category: "hardware",
  tags: ["premium", "durable"],
  variants: [
    { size: "small", sku: "NEW-FORMAT-001-S", stock: 50 },
    { size: "large", sku: "NEW-FORMAT-001-L", stock: 30 }
  ]
})

// Application middleware normalizes to the current format
function readProduct(sku) {
  const doc = db.inventory.findOne({ sku: sku })
  if (doc.schema_version === 1) {
    return {
      ...doc,
      tags: [],
      variants: [{ size: "one-size", sku: doc.sku, stock: doc.stock || 0 }]
    }
  }
  return doc
}

Computed Pattern for Read-Heavy Aggregations

When the same aggregation runs on every page load, pre-compute the result at write time instead of recalculating on read.

// Instead of scanning all orders daily, maintain a computed counter
db.genrePopularity.updateOne(
  { genre: "Science Fiction" },
  {
    $inc: { dailyViews: 1 },
    $setOnInsert: {
      genre: "Science Fiction",
      date: new Date("2026-04-26"),
      lastComputed: new Date()
    }
  },
  { upsert: true }
)

// Reset daily counts at midnight using a scheduled job
db.genrePopularity.updateMany(
  {},
  [{ $set: { dailyViews: 0 } }, { $set: { lastComputed: new Date() } }]
)

Practice Questions and Solutions

The scenarios below are adapted from the M320 course assessment. Work through each one to test your understanding.

Scenario: Data Architect Drives Schema Migration

A new Data Architect proposes a reorganization of the database schema. Your application code already supports the new structure. How do you migrate the existing data with minimum downtime?

The Schema Versioning pattern is the right tool. Add a schema_version field to every document. Deploy application code that reads both versions. Run a background migration to rewrite old-format documents incrementally.

Scenario: Daily Genre Popularity

Management needs daily genre popularity counts without running expensive queries against the entire order history. Which pattern?

The Computed Pattern. Maintain a rolling counter per genre that increments with every sale. Run a nightly reset or keep running totals partitioned by date.

Scenario: Factory IoT Sensors

A factory installs lighting sensors that report every 10 seconds. Management wants hourly reports. Data must be removable after 5 years. Which pattern?

The Bucket pattern. Group 360 readings into one hourly bucket document per sensor. Apply a TTL index on the bucket creation timestamp.

Scenario: NYC Bodega Inventory

A neighborhood convenience store chain sells electronics, groceries, hardware, and clothing — all with different attributes. Inventory items must be searchable across categories. Which pattern?

The Polymorphic pattern. Store every product in a single inventory collection with a type discriminator field and a flexible specifications subdocument.

Scenario: E-Commerce N+1 Query Problem

Every inventory item retrieval triggers additional queries on the orders collection to fetch contextual order data. The N+1 overhead causes latency spikes at peak hours. Which pattern?

The Extended Reference pattern. Copy frequently accessed order summary fields (total sold, recent 30-day count, average rating) directly into the inventory document. This eliminates the $lookup for the common read path.

Scenario: Chemical Manufacturer Slowdown

Inventory documents contain small hot fields (price, quantity, warehouse) and large cold fields (multi-megabyte Material Safety Data Sheet PDFs). At peak production hours, reads slow down because the large PDF metadata pollutes the working set. Which pattern?

The Subset pattern. Split the inventory into two collections: inventory for hot operational fields and safetyDocs for the cold PDF metadata. Only inventory needs to stay in the working set.

Scenario: Workload Quantification

A navigation app has 10 million active devices, each sending 100 bytes/min. Peak reaches 50 million devices. Retention is one year.

Storage: 530 TB
Average write rate: 170,000 writes/second
Peak write rate: 830,000 writes/second

At these numbers, the system must use sharding with a well-chosen shard key and a working set strategy that accounts for the most recent data being accessed most frequently.

MongoDB Data Modeling: Complete Guide & Best Practices

The Core Decision: Embed vs Reference

Decision Framework

Embedded Document Pattern

Referenced Document Pattern with $lookup

Schema Design Patterns Summary

Attribute Pattern

Extended Reference Pattern

Subset Pattern

Schema Validation in MongoDB

Real-World Workload Estimation

Production Scaling Considerations

Shard Key Selection

Working Set Sizing

Index Planning

Data Modeling for IoT and Time-Series

Data Modeling for E-Commerce

Schema Versioning for Live Migrations

Computed Pattern for Read-Heavy Aggregations

Practice Questions and Solutions

Scenario: Data Architect Drives Schema Migration

Scenario: Daily Genre Popularity

Scenario: Factory IoT Sensors

Scenario: NYC Bodega Inventory

Scenario: E-Commerce N+1 Query Problem

Scenario: Chemical Manufacturer Slowdown

Scenario: Workload Quantification

Resources

Comments

Share this article

👍 Was this article helpful?