Introduction to Data Modeling
Data modeling in MongoDB is the process of designing how data is structured, stored, and accessed. A well-designed data model leads to:
- Good performance for read and write operations
- Maximizing developer productivity
- Minimizing overall infrastructure and operational costs
The data modeling workflow follows four phases:
- Gather the requirements — Understand application data needs, query patterns, and performance goals
- Create a conceptual model — Map the requirements into an abstract representation of entities and relationships
- Apply transformation patterns — Use proven patterns to optimize the model for MongoDB’s document model
- Evolve the model — Refactor as application requirements and usage patterns change over time
Course Prerequisites
Here are some terms and references for your benefit:
MongoDB Concepts and Vocabulary
Relational Database Concepts and Vocabulary
- Table (Wikipedia Definition)
- Table (Textbook Definition)
- Entity Relationship Model (Wikipedia Definition)
- The Entity Relationship Data Model
- Crow’s Foot Notation and ERD
- Crow’s Foot Notation Definition
General Database Concepts and Definitions
- Database (Wikipedia Definition)
- Schema (Wikipedia Definition)
- Schema Short Definition
- Database Transactions (Wikipedia Definition)
- Database Transactions Short Description
- Throughput vs Latency
- NoSQL Databases
MongoDB Compass and Atlas
Data Modeling in MongoDB
MongoDB is schemaless but not schema-free. Schema refers to the logical structure of data. While MongoDB does not enforce a schema at the database level, every application implicitly defines a schema through its read and write patterns.
ERD and UML tooling can help visualize relationships, but MongoDB data modeling starts from a different question: instead of “what entities exist?”, ask “how is the data accessed?”.
Key factors that drive data modeling decisions:
- Usage pattern — how your application accesses data
- Query access — which queries are critical to performance
- Read-to-write ratio — the proportion of read versus write operations
- Data growth — how the dataset scales over time
Document validation enforces rules at the collection level:
db.createCollection("orders", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["customerId", "items", "total"],
properties: {
customerId: { bsonType: "objectId" },
total: { bsonType: "number", minimum: 0 },
status: { enum: ["pending", "shipped", "delivered"] },
},
},
},
});
To join collections, use $lookup in the aggregation pipeline.
The Document Model in MongoDB
BSON (Binary JSON) is the binary representation of JSON-like documents used to store data in MongoDB.
- MongoDB stores data as documents — self-contained records with a flexible structure
- Document fields can be values, embedded documents, or arrays of values and documents
- MongoDB is a flexible schema database that supports polymorphism
// Example document — a user profile with embedded data
{
_id: ObjectId("507f1f77bcf86cd799439011"),
name: {
first: "Jane",
last: "Smith"
},
email: "[email protected]",
roles: ["admin", "editor"],
address: {
street: "123 Main St",
city: "San Francisco",
state: "CA",
zip: "94105"
},
createdAt: ISODate("2026-01-15T10:30:00Z")
}
Document Model vs. Relational Model
| Aspect | MongoDB Document Model | Relational Model |
|---|---|---|
| Data unit | Document (JSON/BSON) | Row in a table |
| Schema | Flexible, per-document | Rigid, defined per table |
| Relationships | Embedded docs or references | Foreign key joins |
| Query language | JSON-like query API | SQL |
| Scaling | Horizontal (sharding) | Vertical (scale up) |
| Joins | $lookup (limited) |
Native JOIN (powerful) |
| Transactions | Multi-document (since 4.0) | ACID from inception |
| Indexing | B-tree indexes on any field | B-tree indexes on columns |
| Atomicity | Per-document (single doc) | Per-row (single row) |
When to Choose the Document Model
The document model excels when:
- Data has a natural aggregate structure — an order with line items, a blog post with comments
- The application reads and writes complete entities rather than normalized fragments
- Schema evolution is frequent — adding fields does not require migrations
- The system needs horizontal scaling for high write throughput
When the Relational Model Is Better
- Complex relationships with many-to-many joins across diverse entities
- Strict referential integrity requirements
- Highly normalized data with complex reporting queries
- Existing team expertise and tooling investment
Embedding vs. Referencing
One of the most critical decisions in MongoDB data modeling is whether to embed related data or store it in a separate collection with references.
Embedding
Embedding stores related data inside a parent document. This is the preferred approach for most MongoDB applications.
// Embedded model — order with line items included
{
_id: ObjectId("..."),
customerId: ObjectId("..."),
orderDate: ISODate("2026-03-15"),
status: "shipped",
items: [
{ productId: ObjectId("..."), name: "Widget", qty: 2, price: 9.99 },
{ productId: ObjectId("..."), name: "Gadget", qty: 1, price: 24.99 }
],
total: 44.97,
shipping: {
address: "123 Main St",
carrier: "UPS",
tracking: "1Z999AA10123456784"
}
}
Advantages of embedding:
- Single read retrieves all related data — no joins needed
- Atomic writes for the entire aggregate — consistent updates
- Lower read latency for data that is always accessed together
- Reduced number of round trips to the database
Trade-offs:
- Document size limit of 16 MB limits how much data can be embedded
- Duplicating data across documents (denormalization) requires careful management
- Updating embedded data in all parent documents can be expensive
Referencing
Referencing stores related data in separate documents and links them by ID.
// Referenced model — order document
{
_id: ObjectId("order001"),
customerId: ObjectId("cust001"),
orderDate: ISODate("2026-03-15"),
status: "shipped",
total: 44.97
}
// Separate items collection
{
_id: ObjectId("item001"),
orderId: ObjectId("order001"),
productId: ObjectId("prod001"),
name: "Widget",
qty: 2,
price: 9.99
}
Advantages of referencing:
- No document size limit concerns — data grows independently
- No data duplication — each fact exists in one place
- Independent updates — updating a product name affects all orders
- Better for data with complex many-to-many relationships
Trade-offs:
- Requires
$lookupto read related data — additional latency - No transactional guarantees across documents (unless using transactions)
- More queries needed to assemble complete entities
Decision Guide: Embed or Reference?
| Scenario | Recommended Approach |
|---|---|
| Data accessed together (contains relationship) | Embed |
| One-to-few relationship (< 100 related items) | Embed |
| Independent lifecycle (related data changes independently) | Reference |
| Many-to-many relationship | Reference |
| Data grows unbounded (comments on a popular post) | Reference |
| Frequently read together, infrequently updated separately | Embed |
| 1:1 relationship with tight coupling | Embed |
Schema Design Principles
Principle 1: Model for Application Needs
Design your schema around how your application queries data, not around abstract entity relationships. Start by listing all application queries and their frequency.
// If the app always loads a user with their recent orders,
// embed recent orders in the user document
{
_id: ObjectId("..."),
name: "Jane Smith",
recentOrders: [
{ orderId: ObjectId("..."), total: 44.97, date: ISODate("2026-03-15") },
{ orderId: ObjectId("..."), total: 19.99, date: ISODate("2026-03-10") }
]
}
Principle 2: Favor Embedding Unless There Is a Reason Not To
Embedding is the default choice. Only use references when embedding causes problems such as 16 MB document size violations, excessive data duplication, or complex update propagation.
Principle 3: Structure Data to Match Query Patterns
Design documents to match the shape of query results. If the application displays order summaries on a dashboard, pre-join the data so each document represents exactly one dashboard entry.
Principle 4: Use Arrays for One-to-Many With Moderate Growth
Arrays are ideal for one-to-many relationships where the array size stays manageable (up to a few hundred items).
{
_id: ObjectId("..."),
name: "Team Alpha",
members: [
{ userId: ObjectId("..."), role: "lead" },
{ userId: ObjectId("..."), role: "member" }
]
}
Principle 5: Index According to Query Patterns
Every schema design must consider indexing. Without proper indexes, even the best schema design will perform poorly.
// Create indexes that support common queries
db.users.createIndex({ email: 1 }, { unique: true });
db.orders.createIndex({ customerId: 1, orderDate: -1 });
db.products.createIndex({ category: 1, price: 1 });
Real-World Data Modeling Examples
E-Commerce Application
An e-commerce platform needs to model products, customers, orders, and reviews.
// Products collection — embedded variants and categories
{
_id: ObjectId("..."),
name: "Wireless Headphones",
description: "Noise-canceling Bluetooth headphones",
price: 79.99,
category: "Electronics",
tags: ["audio", "wireless", "bluetooth"],
variants: [
{ color: "Black", sku: "WH-BLK-001", stock: 50 },
{ color: "White", sku: "WH-WHT-001", stock: 30 }
],
reviews: [
{ userId: ObjectId("..."), rating: 4, text: "Great sound", date: ISODate("2026-02-10") }
],
averageRating: 4.2,
reviewCount: 127,
createdAt: ISODate("2025-06-01")
}
// Orders collection — embedded line items
{
_id: ObjectId("..."),
customerId: ObjectId("..."),
orderDate: ISODate("2026-03-15"),
status: "shipped",
items: [
{ productId: ObjectId("..."), name: "Wireless Headphones", qty: 1, price: 79.99 },
{ productId: ObjectId("..."), name: "USB Cable", qty: 2, price: 5.99 }
],
shipping: {
address: "456 Oak Ave",
city: "Portland",
state: "OR",
zip: "97201"
},
payment: {
method: "credit_card",
transactionId: "txn_abc123"
},
total: 91.97
}
The products collection embeds variants and reviews because they are always displayed together. The orders collection embeds line items because an order always displays its items, and the number of items per order is bounded.
Social Media Platform
A social media application handles users, posts, comments, and likes.
// Users collection
{
_id: ObjectId("..."),
username: "jdoe",
displayName: "Jane Doe",
profile: {
bio: "Software engineer and photographer",
avatar: "https://cdn.example.com/avatars/jdoe.jpg"
},
followersCount: 1240,
followingCount: 342,
postCount: 89
}
// Posts collection — references for high-write data
{
_id: ObjectId("..."),
userId: ObjectId("..."),
content: "Just deployed my first microservice!",
images: ["https://cdn.example.com/posts/img1.jpg"],
likesCount: 42,
commentsCount: 7,
createdAt: ISODate("2026-03-15T08:30:00Z"),
// last few comments are embedded for fast display
recentComments: [
{
userId: ObjectId("..."),
username: "msmith",
text: "Congratulations!",
createdAt: ISODate("2026-03-15T09:00:00Z")
}
]
}
// Likes collection — referenced for scalability
{
_id: ObjectId("..."),
postId: ObjectId("..."),
userId: ObjectId("..."),
createdAt: ISODate("2026-03-15T08:35:00Z")
}
The social media model uses a hybrid approach. Recent comments are embedded for instant display while all comments and likes are stored in separate collections for scalable writes. Count fields (likesCount, commentsCount) are pre-computed to avoid counting on every read.
IoT Sensor Data
An IoT platform collects temperature and humidity readings from thousands of sensors.
// Bucket pattern — group readings by time window
{
_id: ObjectId("..."),
sensorId: "sensor-abc-001",
location: {
building: "Building A",
floor: 3,
room: "Conference Room"
},
startDate: ISODate("2026-03-15T10:00:00Z"),
endDate: ISODate("2026-03-15T11:00:00Z"),
readings: [
{ timestamp: ISODate("2026-03-15T10:00:00Z"), temperature: 22.5, humidity: 45 },
{ timestamp: ISODate("2026-03-15T10:05:00Z"), temperature: 22.7, humidity: 44 },
// ... up to 12 readings per hour at 5-minute intervals
],
readingCount: 12,
metadata: {
model: "TempHumidityPro v2",
firmware: "3.1.0"
}
}
The bucket pattern groups sensor readings into hour-long documents. This reduces the total document count from millions to thousands while keeping each document under 16 MB. It also makes time-range queries efficient because all readings for a sensor-hour are in one document.
Constraints in Data Modeling
Hardware and operational constraints shape data modeling decisions:
- RAM — Keep frequently accessed documents and indexes in RAM for fast access
- Storage — Prefer solid state drives (SSD) over hard disk drives (HDD)
- Network — Latency between application and database affects query design
- Document size — 16 MB maximum document size limits embedded data volume
- Replication — Write concern and read preference affect consistency and performance
// Monitor working set size to ensure it fits in RAM
db.serverStatus().wiredTiger.cache["bytes currently in the cache"]
| Constraint | Impact on Modeling |
|---|---|
| RAM size | Limits how much hot data can live in memory |
| Disk speed | Affects read latency for infrequently accessed data |
| Network bandwidth | Encourages embedding to reduce round trips |
| Document size (16 MB) | Limits array and embedded document growth |
| Index size | Must fit in RAM for optimal performance |
Recap:
- The nature of your dataset and hardware defines the need to model your data
- Identify constraints and their impact to create a better model
- As your software and technological landscape change, re-evaluate and update your model accordingly
When working with MongoDB, security features, network performance, disk drive speed, and amount of RAM are all aspects you need to keep in mind. The operating system typically abstracts hardware differences from the database.
The Data Modeling Methodology
The MongoDB data modeling methodology follows a flexible, iterative process:
- Identify the workload — Document application queries, reads, writes, and performance requirements
- Create a conceptual model — Map entities and relationships based on the workload
- Apply patterns — Use proven patterns (embedding, referencing, bucket, computed, etc.) to optimize the model
- Test and refine — Validate against real-world performance and refactor as needed
Model for Simplicity or Performance
Modeling for Simplicity
The simplest model embeds all related data into a single document. This is easy to develop and maintain but may not perform optimally for all workloads. Use this approach for prototypes, internal tools, and applications with modest traffic.
Modeling for Performance
Performance-optimized models use a combination of embedding, referencing, pre-computation, and strategic indexing. This approach requires more design effort but delivers lower latency and higher throughput. Use this for customer-facing applications, real-time systems, and high-traffic services.
Modeling for a Mix of Simplicity and Performance
Most production systems use a mixed approach: embed data that is always accessed together, reference data that changes independently, and pre-compute values that are read frequently. This balances developer productivity with operational performance.
Summary of Modeling Approaches:
| Approach | Effort | Performance | Use Case |
|---|---|---|---|
| Simplicity | Low | Adequate | Prototypes, internal tools |
| Performance | High | Optimal | Customer-facing, real-time |
| Mixed | Medium | High | Most production applications |
Identifying the Workload
Case Study: IoT Data Platform
- Organization manages 100 million weather sensors
- Requirements:
- Collect data from all devices at 5-minute intervals
- Analyze data trends with a team of 10 data scientists
- Support real-time dashboard with sub-second query latency
- Archive historical data for long-term analysis
Workload Characteristics
| Operation | Frequency | Latency Requirement |
|---|---|---|
| Sensor data write | 100M writes / 5 min | < 10 ms per write |
| Latest reading query | 50,000 reads / sec | < 50 ms |
| Historical trend analysis | 100 queries / day | < 30 seconds |
| Alert generation | 1,000 alerts / min | < 1 second |
Data Durability Requirements
| Data Type | Durability | Retention |
|---|---|---|
| Raw sensor readings | w: majority | 90 days |
| Aggregated hourly stats | w: majority | 2 years |
| Alert events | w: majority | 1 year |
| System logs | w: 1 | 30 days |
- Quantify and qualify your queries as much as possible
- A small set of CRUD operations typically drives the entire schema design
- Document every query with its frequency, acceptable latency, and data volume
Comments