Data modeling is the most impactful design decision you make in MongoDB. Get it right and your application scales effortlessly, queries fly, and hardware stays modest. Get it wrong and you fight performance problems, write increasingly complex workarounds, and burn operational budget on unnecessary resources.
This guide covers the complete MongoDB data modeling landscape: the embed-versus-reference decision framework, all nine schema design patterns, real-world workload estimation, schema validation, IoT and e-commerce modeling patterns, and production scaling considerations. Every section includes working code examples you can adapt immediately.
The Core Decision: Embed vs Reference
Every MongoDB schema design starts with one fundamental choice: do you embed the related data inside the parent document, or do you store a reference and resolve it at query time? There is no universal right answer. The correct choice depends on your data access patterns, consistency requirements, and growth characteristics.
Decision Framework
The table below enumerates every factor that influences the embed-or-reference decision, along with guidance for each case.
| Factor | Embed When | Reference When |
|---|---|---|
| Data access pattern | Child data is almost always read with the parent | Child data is queried independently from the parent |
| Document growth | Subdocuments have bounded growth (fixed array, limited entries) | Subdocuments grow without bound (unlimited comments, log entries) |
| Data locality | You need sub-5ms reads that include related data | Network round trips for joins are acceptable |
| Atomicity | Updates to parent and child must be atomic and isolated | You can tolerate eventual consistency across collections |
| Write frequency | Child data is read-heavy, write-light | Child data is written every second independently |
| Document size | Total document stays under 16 MB with headroom | Embedded data would push the document past 4-8 MB |
| Duplication cost | Duplicated data changes rarely or never | Duplicated data changes frequently and must be consistent |
| Query simplicity | You want single-query reads with no joins | You prefer normalized storage and accept application-level joins |
| Index efficiency | You want to index on fields across parent and child | Your access patterns benefit from collection-level indexes |
| Sharding | Child data must live on the same shard as the parent | Child documents need their own shard key distribution |
When three or more factors point in the same direction, the decision is clear. When factors are evenly split, start with embedding and extract to a separate collection only when you hit a concrete bottleneck.
Embedded Document Pattern
Embedding stores related data directly inside the parent document. This is the natural starting point for MongoDB schemas because it provides single-operation reads and atomic updates.
// Embedded pattern: order contains line items directly
db.orders.insertOne({
orderId: "ORD-2026-001",
customerId: "CUST-4382",
orderDate: new Date("2026-04-20T14:30:00Z"),
status: "shipped",
items: [
{ sku: "MONGO-BOOK", title: "MongoDB Data Modeling Guide", qty: 1, price: 49.99 },
{ sku: "MONGO-TEE", title: "MongoDB Logo T-Shirt", qty: 2, price: 24.99 }
],
shippingAddress: {
street: "55 Main Street",
city: "New York",
state: "NY",
zip: "10001"
},
total: 99.97
})
Referenced Document Pattern with $lookup
Reference by ObjectId when the related data is large, independently queried, or grows without bound. Use the aggregation pipeline’s $lookup stage to reassemble the data at read time.
// Orders collection (stores only the reference)
db.orders.insertOne({
orderId: "ORD-2026-001",
customerId: "CUST-4382",
orderDate: new Date("2026-04-20T14:30:00Z"),
status: "shipped",
itemIds: [ObjectId("67f8a1b2c3d4e5f6a7b8c9d0"), ObjectId("67f8a1b2c3d4e5f6a7b8c9d1")],
total: 99.97
})
// Items collection (normalized independently)
db.items.insertMany([
{ _id: ObjectId("67f8a1b2c3d4e5f6a7b8c9d0"), sku: "MONGO-BOOK", title: "MongoDB Data Modeling Guide", price: 49.99, stock: 230 },
{ _id: ObjectId("67f8a1b2c3d4e5f6a7b8c9d1"), sku: "MONGO-TEE", title: "MongoDB Logo T-Shirt", price: 24.99, stock: 540 }
])
// Reassemble with $lookup
db.orders.aggregate([
{ $match: { orderId: "ORD-2026-001" } },
{ $lookup: {
from: "items",
localField: "itemIds",
foreignField: "_id",
as: "items"
}}
])
Schema Design Patterns Summary
MongoDB University’s M320 course introduced nine battle-tested schema design patterns. Each solves a specific class of problem. The table below summarizes every pattern with its use case, trade-off, and a concrete example.
| Pattern | Use Case | Trade-Off | Example |
|---|---|---|---|
| Attribute | Many similar fields, sparse data, or EAV-style models | Harder to validate; querying by attribute key requires $exists checks |
Product specifications where different categories have different attributes |
| Extended Reference | Frequently repeated $lookup across collections creates read overhead |
Duplicates data; write path must keep copies in sync | Inventory document embeds frequently-read order summary fields to avoid joins |
| Subset | Large documents with rarely-used fields slow down working set and I/O | Application must handle two collections and occasional refetch | Store product details in one collection and large PDF safety sheets in another |
| Computed | Expensive aggregations run repeatedly on the same data | Stale values if not updated promptly; extra write-time cost | Pre-compute daily genre popularity counts instead of scanning all orders |
| Bucket | Time-series or IoT data written at high frequency with periodic reads | Loses individual event granularity; bucket boundaries are fixed | Group sensor readings into hourly buckets instead of one document per reading |
| Schema Versioning | Documents must evolve in production without downtime | Application must handle multiple schema versions in code | Add schema_version: 2 field; new code reads old and new formats |
| Polymorphic | Documents of different shapes live in the same collection | Queries and indexes must account for multiple document types | E-commerce inventory: electronics, clothing, groceries all in one collection |
| Outlier | A small number of documents buck the normal access or size pattern | Adds complexity to handle the special-case documents | A bestselling book with 500,000 reviews while most books have fewer than 100 |
| Approximation | Exact counts are not needed and write volume is high | Value is approximate, never exact | Use a counter document updated every N writes instead of every write |
Attribute Pattern
Use the Attribute pattern when documents share the same structure but store different sets of sparse fields. Instead of dozens of nullable columns, use an array of key-value pairs.
// Product specifications with the Attribute pattern
db.products.insertOne({
sku: "CAM-1001",
category: "Camera",
attributes: [
{ key: "megapixels", value: 24.2 },
{ key: "sensor_type", value: "CMOS" },
{ key: "iso_range", value: "100-25600" },
{ key: "video_resolution", value: "4K" }
]
})
// Query all cameras with at least 20 megapixels
db.products.find({
category: "Camera",
"attributes": { $elemMatch: { key: "megapixels", value: { $gte: 20 } } }
})
Extended Reference Pattern
When a high-read-path $lookup creates latency, copy the most-accessed fields from the related collection into the parent document.
// Before: every inventory read requires a lookup on orders
// Inventory document with Extended Reference
db.inventory.insertOne({
sku: "MONGO-BOOK",
title: "MongoDB Data Modeling Guide",
price: 49.99,
stock: 230,
recentOrders: {
totalSold: 1842,
last30Days: 147,
avgRating: 4.7
}
})
Subset Pattern
Documents that carry a mix of hot (frequently accessed) and cold (rarely accessed) fields benefit from splitting into two collections. The main collection stays small and stays in the working set.
// Inventory collection: hot fields only
db.inventory.insertOne({
sku: "CHEM-AMMONIA",
name: "Ammonium Hydroxide",
price: 14.50,
qtyOnHand: 88,
warehouse: "WH-NJ-03"
})
// Safety documents collection: cold fields with large PDF metadata
db.safetyDocs.insertOne({
sku: "CHEM-AMMONIA",
sdsUrl: "/pdf/msds-ammonium-hydroxide.pdf",
sdsSize: 2840000,
storageClass: "Flammable",
handlingNotes: "Store below 25°C in ventilated area",
lastReviewed: new Date("2026-03-15")
})
Schema Validation in MongoDB
MongoDB’s flexible schema does not mean no schema. Use $jsonSchema to enforce document structure, field types, required fields, and value ranges at the database level. Validation runs on every insert and update.
// Create a validated collection
db.createCollection("orders", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["orderId", "customerId", "items", "total"],
properties: {
orderId: {
bsonType: "string",
description: "must be a string and is required"
},
customerId: {
bsonType: "string",
description: "must be a string and is required"
},
total: {
bsonType: "number",
minimum: 0,
description: "must be a non-negative number"
},
status: {
enum: ["pending", "processing", "shipped", "delivered", "cancelled"],
description: "must be one of the enumerated status values"
},
items: {
bsonType: "array",
minItems: 1,
items: {
bsonType: "object",
required: ["sku", "qty", "price"],
properties: {
sku: { bsonType: "string" },
qty: { bsonType: "int", minimum: 1 },
price: { bsonType: "number", minimum: 0 }
}
}
}
}
}
},
validationAction: "error"
})
// This insert succeeds
db.orders.insertOne({
orderId: "ORD-2026-002",
customerId: "CUST-8192",
items: [{ sku: "MONGO-BOOK", qty: 1, price: 49.99 }],
total: 49.99,
status: "pending"
})
// This insert fails validation (negative total)
db.orders.insertOne({
orderId: "ORD-2026-003",
customerId: "CUST-8192",
items: [{ sku: "MONGO-BOOK", qty: 1, price: 49.99 }],
total: -10
})
Validation actions control how MongoDB responds:
validationAction: "error"rejects non-conforming writes.validationAction: "warn"allows the write but logs a warning. Use during migration from an unvalidated to a validated schema.validationLevel: "strict"applies validation to all inserts and updates. This is the default.validationLevel: "moderate"applies validation only to documents that already pass validation. Useful for gradual enforcement across an existing collection.
Real-World Workload Estimation
Before you design indexes or choose a shard key, quantify your workload. Use concrete numbers to determine write throughput, storage sizing, and whether sharding is necessary.
Consider a navigation application with 10 million active devices, each sending 100 bytes every minute. Peak usage climbs to 50 million devices. Data must be retained for one year.
# Storage calculation
10,000,000 devices * 100 bytes/min * 60 * 24 * 365 min/year = 5.256 * 10^15 bytes
Rounded: 530 terabytes
# Average write rate
10,000,000 devices / 60 seconds = 166,666 writes/second
Rounded: 170,000 writes/second
# Peak write rate
50,000,000 devices / 60 seconds = 833,333 writes/second
Rounded: 830,000 writes/second
At 170,000 average writes per second and 530 TB of storage, this workload demands careful index planning and almost certainly requires sharding. A single mongod instance cannot sustain that write volume while maintaining a reasonable working set. The shard key must distribute writes evenly across all shards, avoiding any single shard becoming a bottleneck.
// Index for the workload: compound index covering common query patterns
db.locations.createIndex(
{ deviceId: 1, timestamp: -1 },
{ background: true }
)
// TTL index to automatically purge data older than 365 days
db.locations.createIndex(
{ timestamp: 1 },
{ expireAfterSeconds: 31536000 }
)
Production Scaling Considerations
Shard Key Selection
Choose a shard key that provides write distribution and supports your most frequent query patterns. Avoid monotonically increasing shard keys like {timestamp: 1} because all new writes hit the last shard. Prefer hashed shard keys or compound keys that include a high-cardinality field.
// Hashed shard key for even write distribution
sh.shardCollection("nav.locationUpdates", { deviceId: "hashed" })
// Compound shard key that supports range queries and distributes writes
sh.shardCollection("nav.locationUpdates", { deviceId: 1, timestamp: -1 })
Working Set Sizing
The working set is the portion of data accessed most frequently. If the working set fits in RAM, MongoDB serves reads from memory and delivers sub-millisecond latency. If it exceeds RAM, MongoDB pages data from disk, increasing latency by orders of magnitude.
// Check working set size via serverStatus
db.adminCommand({ serverStatus: 1 }).wiredTiger.cache
// Monitor page faults
db.serverStatus().extra_info.page_faults
When the working set grows beyond available RAM, consider:
- Adding more memory to the replica set members
- Using the Subset pattern to shrink document sizes
- Adding indexes to reduce the number of documents scanned per query
- Sharding to distribute the working set across multiple nodes
Index Planning
Each index on a collection adds write overhead proportional to the number of index entries. Index planning in MongoDB is a balance: enough indexes to serve queries efficiently, few enough that writes do not degrade.
// Single-field index for equality lookups
db.orders.createIndex({ customerId: 1 })
// Compound index that covers the query (no document fetch needed)
db.orders.createIndex(
{ customerId: 1, orderDate: -1, total: 1 },
{ name: "customer_orders_recents" }
)
// Text index for full-text search on product descriptions
db.products.createIndex(
{ title: "text", description: "text" },
{ weights: { title: 10, description: 5 } }
)
Use explain() to validate that your queries use the expected index:
db.orders.find({ customerId: "CUST-4382", orderDate: { $gte: ISODate("2026-01-01") } })
.sort({ orderDate: -1 })
.explain("executionStats")
Data Modeling for IoT and Time-Series
IoT workloads share a common profile: high-frequency writes from many devices, periodic reads over time windows, and a need to purge old data efficiently. The Bucket pattern is the standard solution.
Instead of one document per reading, batch readings into fixed-time buckets. This reduces the total document count, improves index efficiency, and makes TTL-based expiration straightforward.
// Without bucketing: one document per reading (50 million docs/month for 1000 devices)
// With bucketing: hourly bucket (720 buckets/month per device)
// Insert an hourly bucket
db.lightSensorReadings.updateOne(
{
deviceId: "LIGHT-FL02-ZONE4",
bucket: "2026-04-20-14"
},
{
$setOnInsert: {
deviceId: "LIGHT-FL02-ZONE4",
bucket: "2026-04-20-14",
createdAt: new Date("2026-04-20T14:00:00Z"),
metadata: { floor: 2, zone: "Zone 4", type: "LED-PIR" }
},
$push: {
readings: {
$each: [
{ ts: new Date("2026-04-20T14:00:10Z"), value: 320 },
{ ts: new Date("2026-04-20T14:00:20Z"), value: 315 },
{ ts: new Date("2026-04-20T14:00:30Z"), value: 310 }
]
}
},
$inc: { readingCount: 3 }
},
{ upsert: true }
)
// Create TTL index on the bucket creation time
// Data older than 5 years is automatically deleted
db.lightSensorReadings.createIndex(
{ createdAt: 1 },
{ expireAfterSeconds: 157680000 }
)
// Hourly report: aggregate readings within a bucket
// Management wants hourly summaries, not per-second data
db.lightSensorReadings.aggregate([
{ $match: { deviceId: "LIGHT-FL02-ZONE4", bucket: { $gte: "2026-04-20", $lt: "2026-04-21" } } },
{ $unwind: "$readings" },
{ $group: {
_id: { hour: { $substr: ["$bucket", 11, 2] } },
avgLux: { $avg: "$readings.value" },
minLux: { $min: "$readings.value" },
maxLux: { $max: "$readings.value" },
sampleCount: { $sum: 1 }
}},
{ $sort: { "_id.hour": 1 } }
])
Data Modeling for E-Commerce
E-commerce catalogs carry products of wildly different types — electronics, clothing, groceries, hardware — each with its own set of attributes. The Polymorphic pattern handles this naturally by allowing documents of different shapes to coexist in the same collection.
// Polymorphic inventory collection: diverse products in one collection
db.inventory.insertMany([
{
type: "electronics",
sku: "ELEC-001",
name: "Wireless Bluetooth Headphones",
price: 89.99,
brand: "SoundMax",
specifications: {
batteryLife: "30 hours",
bluetoothVersion: "5.3",
driverSize: "40mm"
},
warrantyMonths: 24
},
{
type: "clothing",
sku: "CLTH-422",
name: "Merino Wool Base Layer",
price: 79.99,
brand: "TrailGear",
specifications: {
material: "100% Merino Wool",
sizes: ["S", "M", "L", "XL"],
color: "Charcoal",
careInstructions: "Machine wash cold, lay flat to dry"
}
},
{
type: "groceries",
sku: "GROC-887",
name: "Organic Cold-Pressed Olive Oil",
price: 16.99,
brand: "Mediterra",
specifications: {
volume: "750ml",
origin: "Greece",
organic: true,
expirationDate: new Date("2027-06-15")
}
}
])
// Create indexes that work across polymorphic documents
db.inventory.createIndex({ type: 1, "specifications.material": 1 })
db.inventory.createIndex({ type: 1, price: 1 })
db.inventory.createIndex({ name: "text", "specifications.material": "text" })
The Polymorphic pattern also enables cross-category search. A single query can filter by type and sort by price across all product categories.
// Customers can search for all products under $50, sorted by price
db.inventory.find({ price: { $lte: 50 } }).sort({ price: 1 }).limit(20)
Schema Versioning for Live Migrations
Production schemas evolve. The Schema Versioning pattern lets you roll out structural changes without taking downtime. Each document carries a schema_version field, and the application handles both old and new formats during the transition.
// Old document format
db.inventory.insertOne({
schema_version: 1,
sku: "OLD-FORMAT-001",
name: "Widget",
price: 9.99,
category: "hardware"
})
// New document format after migration
db.inventory.insertOne({
schema_version: 2,
sku: "NEW-FORMAT-001",
name: "Widget Pro",
price: 14.99,
category: "hardware",
tags: ["premium", "durable"],
variants: [
{ size: "small", sku: "NEW-FORMAT-001-S", stock: 50 },
{ size: "large", sku: "NEW-FORMAT-001-L", stock: 30 }
]
})
// Application middleware normalizes to the current format
function readProduct(sku) {
const doc = db.inventory.findOne({ sku: sku })
if (doc.schema_version === 1) {
return {
...doc,
tags: [],
variants: [{ size: "one-size", sku: doc.sku, stock: doc.stock || 0 }]
}
}
return doc
}
Computed Pattern for Read-Heavy Aggregations
When the same aggregation runs on every page load, pre-compute the result at write time instead of recalculating on read.
// Instead of scanning all orders daily, maintain a computed counter
db.genrePopularity.updateOne(
{ genre: "Science Fiction" },
{
$inc: { dailyViews: 1 },
$setOnInsert: {
genre: "Science Fiction",
date: new Date("2026-04-26"),
lastComputed: new Date()
}
},
{ upsert: true }
)
// Reset daily counts at midnight using a scheduled job
db.genrePopularity.updateMany(
{},
[{ $set: { dailyViews: 0 } }, { $set: { lastComputed: new Date() } }]
)
Practice Questions and Solutions
The scenarios below are adapted from the M320 course assessment. Work through each one to test your understanding.
Scenario: Data Architect Drives Schema Migration
A new Data Architect proposes a reorganization of the database schema. Your application code already supports the new structure. How do you migrate the existing data with minimum downtime?
The Schema Versioning pattern is the right tool. Add a schema_version field to every document. Deploy application code that reads both versions. Run a background migration to rewrite old-format documents incrementally.
Scenario: Daily Genre Popularity
Management needs daily genre popularity counts without running expensive queries against the entire order history. Which pattern?
The Computed Pattern. Maintain a rolling counter per genre that increments with every sale. Run a nightly reset or keep running totals partitioned by date.
Scenario: Factory IoT Sensors
A factory installs lighting sensors that report every 10 seconds. Management wants hourly reports. Data must be removable after 5 years. Which pattern?
The Bucket pattern. Group 360 readings into one hourly bucket document per sensor. Apply a TTL index on the bucket creation timestamp.
Scenario: NYC Bodega Inventory
A neighborhood convenience store chain sells electronics, groceries, hardware, and clothing — all with different attributes. Inventory items must be searchable across categories. Which pattern?
The Polymorphic pattern. Store every product in a single inventory collection with a type discriminator field and a flexible specifications subdocument.
Scenario: E-Commerce N+1 Query Problem
Every inventory item retrieval triggers additional queries on the orders collection to fetch contextual order data. The N+1 overhead causes latency spikes at peak hours. Which pattern?
The Extended Reference pattern. Copy frequently accessed order summary fields (total sold, recent 30-day count, average rating) directly into the inventory document. This eliminates the $lookup for the common read path.
Scenario: Chemical Manufacturer Slowdown
Inventory documents contain small hot fields (price, quantity, warehouse) and large cold fields (multi-megabyte Material Safety Data Sheet PDFs). At peak production hours, reads slow down because the large PDF metadata pollutes the working set. Which pattern?
The Subset pattern. Split the inventory into two collections: inventory for hot operational fields and safetyDocs for the cold PDF metadata. Only inventory needs to stay in the working set.
Scenario: Workload Quantification
A navigation app has 10 million active devices, each sending 100 bytes/min. Peak reaches 50 million devices. Retention is one year.
- Storage: 530 TB
- Average write rate: 170,000 writes/second
- Peak write rate: 830,000 writes/second
At these numbers, the system must use sharding with a well-chosen shard key and a working set strategy that accounts for the most recent data being accessed most frequently.
Resources
- MongoDB Documentation on Data Modeling
- MongoDB Schema Validation
- MongoDB Sharding Documentation
- Choosing a Shard Key
- MongoDB Indexing Strategy
- Analyze Query Performance
- MongoDB Transactions
- Building with Patterns: A Summary
- Schema Design Patterns Video (MongoDB World)
- MongoDB University M320: Data Modeling
Comments