M320: Chapter 1: Introduction

Introduction to Data Modeling

Data modeling in MongoDB is the process of designing how data is structured, stored, and accessed. A well-designed data model leads to:

Good performance for read and write operations
Maximizing developer productivity
Minimizing overall infrastructure and operational costs

The data modeling workflow follows four phases:

Gather the requirements — Understand application data needs, query patterns, and performance goals
Create a conceptual model — Map the requirements into an abstract representation of entities and relationships
Apply transformation patterns — Use proven patterns to optimize the model for MongoDB’s document model
Evolve the model — Refactor as application requirements and usage patterns change over time

Course Prerequisites

Here are some terms and references for your benefit:

MongoDB Concepts and Vocabulary

Relational Database Concepts and Vocabulary

General Database Concepts and Definitions

MongoDB Compass and Atlas

Data Modeling in MongoDB

MongoDB is schemaless but not schema-free. Schema refers to the logical structure of data. While MongoDB does not enforce a schema at the database level, every application implicitly defines a schema through its read and write patterns.

ERD and UML tooling can help visualize relationships, but MongoDB data modeling starts from a different question: instead of “what entities exist?”, ask “how is the data accessed?”.

Key factors that drive data modeling decisions:

Usage pattern — how your application accesses data
Query access — which queries are critical to performance
Read-to-write ratio — the proportion of read versus write operations
Data growth — how the dataset scales over time

Document validation enforces rules at the collection level:

db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["customerId", "items", "total"],
      properties: {
        customerId: { bsonType: "objectId" },
        total: { bsonType: "number", minimum: 0 },
        status: { enum: ["pending", "shipped", "delivered"] },
      },
    },
  },
});

To join collections, use $lookup in the aggregation pipeline.

The Document Model in MongoDB

BSON (Binary JSON) is the binary representation of JSON-like documents used to store data in MongoDB.

MongoDB stores data as documents — self-contained records with a flexible structure
Document fields can be values, embedded documents, or arrays of values and documents
MongoDB is a flexible schema database that supports polymorphism

// Example document — a user profile with embedded data
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  name: {
    first: "Jane",
    last: "Smith"
  },
  email: "[email protected]",
  roles: ["admin", "editor"],
  address: {
    street: "123 Main St",
    city: "San Francisco",
    state: "CA",
    zip: "94105"
  },
  createdAt: ISODate("2026-01-15T10:30:00Z")
}

Document Model vs. Relational Model

Aspect	MongoDB Document Model	Relational Model
Data unit	Document (JSON/BSON)	Row in a table
Schema	Flexible, per-document	Rigid, defined per table
Relationships	Embedded docs or references	Foreign key joins
Query language	JSON-like query API	SQL
Scaling	Horizontal (sharding)	Vertical (scale up)
Joins	`$lookup` (limited)	Native JOIN (powerful)
Transactions	Multi-document (since 4.0)	ACID from inception
Indexing	B-tree indexes on any field	B-tree indexes on columns
Atomicity	Per-document (single doc)	Per-row (single row)

When to Choose the Document Model

The document model excels when:

Data has a natural aggregate structure — an order with line items, a blog post with comments
The application reads and writes complete entities rather than normalized fragments
Schema evolution is frequent — adding fields does not require migrations
The system needs horizontal scaling for high write throughput

When the Relational Model Is Better

Complex relationships with many-to-many joins across diverse entities
Strict referential integrity requirements
Highly normalized data with complex reporting queries
Existing team expertise and tooling investment

Embedding vs. Referencing

One of the most critical decisions in MongoDB data modeling is whether to embed related data or store it in a separate collection with references.

Embedding

Embedding stores related data inside a parent document. This is the preferred approach for most MongoDB applications.

// Embedded model — order with line items included
{
  _id: ObjectId("..."),
  customerId: ObjectId("..."),
  orderDate: ISODate("2026-03-15"),
  status: "shipped",
  items: [
    { productId: ObjectId("..."), name: "Widget", qty: 2, price: 9.99 },
    { productId: ObjectId("..."), name: "Gadget", qty: 1, price: 24.99 }
  ],
  total: 44.97,
  shipping: {
    address: "123 Main St",
    carrier: "UPS",
    tracking: "1Z999AA10123456784"
  }
}

Advantages of embedding:

Single read retrieves all related data — no joins needed
Atomic writes for the entire aggregate — consistent updates
Lower read latency for data that is always accessed together
Reduced number of round trips to the database

Trade-offs:

Document size limit of 16 MB limits how much data can be embedded
Duplicating data across documents (denormalization) requires careful management
Updating embedded data in all parent documents can be expensive

Referencing

Referencing stores related data in separate documents and links them by ID.

// Referenced model — order document
{
  _id: ObjectId("order001"),
  customerId: ObjectId("cust001"),
  orderDate: ISODate("2026-03-15"),
  status: "shipped",
  total: 44.97
}

// Separate items collection
{
  _id: ObjectId("item001"),
  orderId: ObjectId("order001"),
  productId: ObjectId("prod001"),
  name: "Widget",
  qty: 2,
  price: 9.99
}

Advantages of referencing:

No document size limit concerns — data grows independently
No data duplication — each fact exists in one place
Independent updates — updating a product name affects all orders
Better for data with complex many-to-many relationships

Trade-offs:

Requires $lookup to read related data — additional latency
No transactional guarantees across documents (unless using transactions)
More queries needed to assemble complete entities

Decision Guide: Embed or Reference?

Scenario	Recommended Approach
Data accessed together (contains relationship)	Embed
One-to-few relationship (< 100 related items)	Embed
Independent lifecycle (related data changes independently)	Reference
Many-to-many relationship	Reference
Data grows unbounded (comments on a popular post)	Reference
Frequently read together, infrequently updated separately	Embed
1:1 relationship with tight coupling	Embed

Schema Design Principles

Principle 1: Model for Application Needs

Design your schema around how your application queries data, not around abstract entity relationships. Start by listing all application queries and their frequency.

// If the app always loads a user with their recent orders,
// embed recent orders in the user document
{
  _id: ObjectId("..."),
  name: "Jane Smith",
  recentOrders: [
    { orderId: ObjectId("..."), total: 44.97, date: ISODate("2026-03-15") },
    { orderId: ObjectId("..."), total: 19.99, date: ISODate("2026-03-10") }
  ]
}

Principle 2: Favor Embedding Unless There Is a Reason Not To

Embedding is the default choice. Only use references when embedding causes problems such as 16 MB document size violations, excessive data duplication, or complex update propagation.

Principle 3: Structure Data to Match Query Patterns

Design documents to match the shape of query results. If the application displays order summaries on a dashboard, pre-join the data so each document represents exactly one dashboard entry.

Principle 4: Use Arrays for One-to-Many With Moderate Growth

Arrays are ideal for one-to-many relationships where the array size stays manageable (up to a few hundred items).

{
  _id: ObjectId("..."),
  name: "Team Alpha",
  members: [
    { userId: ObjectId("..."), role: "lead" },
    { userId: ObjectId("..."), role: "member" }
  ]
}

Principle 5: Index According to Query Patterns

Every schema design must consider indexing. Without proper indexes, even the best schema design will perform poorly.

// Create indexes that support common queries
db.users.createIndex({ email: 1 }, { unique: true });
db.orders.createIndex({ customerId: 1, orderDate: -1 });
db.products.createIndex({ category: 1, price: 1 });

Real-World Data Modeling Examples

E-Commerce Application

An e-commerce platform needs to model products, customers, orders, and reviews.

// Products collection — embedded variants and categories
{
  _id: ObjectId("..."),
  name: "Wireless Headphones",
  description: "Noise-canceling Bluetooth headphones",
  price: 79.99,
  category: "Electronics",
  tags: ["audio", "wireless", "bluetooth"],
  variants: [
    { color: "Black", sku: "WH-BLK-001", stock: 50 },
    { color: "White", sku: "WH-WHT-001", stock: 30 }
  ],
  reviews: [
    { userId: ObjectId("..."), rating: 4, text: "Great sound", date: ISODate("2026-02-10") }
  ],
  averageRating: 4.2,
  reviewCount: 127,
  createdAt: ISODate("2025-06-01")
}

// Orders collection — embedded line items
{
  _id: ObjectId("..."),
  customerId: ObjectId("..."),
  orderDate: ISODate("2026-03-15"),
  status: "shipped",
  items: [
    { productId: ObjectId("..."), name: "Wireless Headphones", qty: 1, price: 79.99 },
    { productId: ObjectId("..."), name: "USB Cable", qty: 2, price: 5.99 }
  ],
  shipping: {
    address: "456 Oak Ave",
    city: "Portland",
    state: "OR",
    zip: "97201"
  },
  payment: {
    method: "credit_card",
    transactionId: "txn_abc123"
  },
  total: 91.97
}

The products collection embeds variants and reviews because they are always displayed together. The orders collection embeds line items because an order always displays its items, and the number of items per order is bounded.

A social media application handles users, posts, comments, and likes.

// Users collection
{
  _id: ObjectId("..."),
  username: "jdoe",
  displayName: "Jane Doe",
  profile: {
    bio: "Software engineer and photographer",
    avatar: "https://cdn.example.com/avatars/jdoe.jpg"
  },
  followersCount: 1240,
  followingCount: 342,
  postCount: 89
}

// Posts collection — references for high-write data
{
  _id: ObjectId("..."),
  userId: ObjectId("..."),
  content: "Just deployed my first microservice!",
  images: ["https://cdn.example.com/posts/img1.jpg"],
  likesCount: 42,
  commentsCount: 7,
  createdAt: ISODate("2026-03-15T08:30:00Z"),
  // last few comments are embedded for fast display
  recentComments: [
    {
      userId: ObjectId("..."),
      username: "msmith",
      text: "Congratulations!",
      createdAt: ISODate("2026-03-15T09:00:00Z")
    }
  ]
}

// Likes collection — referenced for scalability
{
  _id: ObjectId("..."),
  postId: ObjectId("..."),
  userId: ObjectId("..."),
  createdAt: ISODate("2026-03-15T08:35:00Z")
}

The social media model uses a hybrid approach. Recent comments are embedded for instant display while all comments and likes are stored in separate collections for scalable writes. Count fields (likesCount, commentsCount) are pre-computed to avoid counting on every read.

IoT Sensor Data

An IoT platform collects temperature and humidity readings from thousands of sensors.

// Bucket pattern — group readings by time window
{
  _id: ObjectId("..."),
  sensorId: "sensor-abc-001",
  location: {
    building: "Building A",
    floor: 3,
    room: "Conference Room"
  },
  startDate: ISODate("2026-03-15T10:00:00Z"),
  endDate: ISODate("2026-03-15T11:00:00Z"),
  readings: [
    { timestamp: ISODate("2026-03-15T10:00:00Z"), temperature: 22.5, humidity: 45 },
    { timestamp: ISODate("2026-03-15T10:05:00Z"), temperature: 22.7, humidity: 44 },
    // ... up to 12 readings per hour at 5-minute intervals
  ],
  readingCount: 12,
  metadata: {
    model: "TempHumidityPro v2",
    firmware: "3.1.0"
  }
}

The bucket pattern groups sensor readings into hour-long documents. This reduces the total document count from millions to thousands while keeping each document under 16 MB. It also makes time-range queries efficient because all readings for a sensor-hour are in one document.

Constraints in Data Modeling

Hardware and operational constraints shape data modeling decisions:

RAM — Keep frequently accessed documents and indexes in RAM for fast access
Storage — Prefer solid state drives (SSD) over hard disk drives (HDD)
Network — Latency between application and database affects query design
Document size — 16 MB maximum document size limits embedded data volume
Replication — Write concern and read preference affect consistency and performance

// Monitor working set size to ensure it fits in RAM
db.serverStatus().wiredTiger.cache["bytes currently in the cache"]

Constraint	Impact on Modeling
RAM size	Limits how much hot data can live in memory
Disk speed	Affects read latency for infrequently accessed data
Network bandwidth	Encourages embedding to reduce round trips
Document size (16 MB)	Limits array and embedded document growth
Index size	Must fit in RAM for optimal performance

Recap:

The nature of your dataset and hardware defines the need to model your data
Identify constraints and their impact to create a better model
As your software and technological landscape change, re-evaluate and update your model accordingly

When working with MongoDB, security features, network performance, disk drive speed, and amount of RAM are all aspects you need to keep in mind. The operating system typically abstracts hardware differences from the database.

The Data Modeling Methodology

The MongoDB data modeling methodology follows a flexible, iterative process:

Identify the workload — Document application queries, reads, writes, and performance requirements
Create a conceptual model — Map entities and relationships based on the workload
Apply patterns — Use proven patterns (embedding, referencing, bucket, computed, etc.) to optimize the model
Test and refine — Validate against real-world performance and refactor as needed

Model for Simplicity or Performance

Modeling for Simplicity

The simplest model embeds all related data into a single document. This is easy to develop and maintain but may not perform optimally for all workloads. Use this approach for prototypes, internal tools, and applications with modest traffic.

Modeling for Performance

Performance-optimized models use a combination of embedding, referencing, pre-computation, and strategic indexing. This approach requires more design effort but delivers lower latency and higher throughput. Use this for customer-facing applications, real-time systems, and high-traffic services.

Modeling for a Mix of Simplicity and Performance

Most production systems use a mixed approach: embed data that is always accessed together, reference data that changes independently, and pre-compute values that are read frequently. This balances developer productivity with operational performance.

Summary of Modeling Approaches:

Approach	Effort	Performance	Use Case
Simplicity	Low	Adequate	Prototypes, internal tools
Performance	High	Optimal	Customer-facing, real-time
Mixed	Medium	High	Most production applications

Identifying the Workload

Case Study: IoT Data Platform

Organization manages 100 million weather sensors
Requirements:
- Collect data from all devices at 5-minute intervals
- Analyze data trends with a team of 10 data scientists
- Support real-time dashboard with sub-second query latency
- Archive historical data for long-term analysis

Workload Characteristics

Operation	Frequency	Latency Requirement
Sensor data write	100M writes / 5 min	< 10 ms per write
Latest reading query	50,000 reads / sec	< 50 ms
Historical trend analysis	100 queries / day	< 30 seconds
Alert generation	1,000 alerts / min	< 1 second

Data Durability Requirements

Data Type	Durability	Retention
Raw sensor readings	w: majority	90 days
Aggregated hourly stats	w: majority	2 years
Alert events	w: majority	1 year
System logs	w: 1	30 days

Quantify and qualify your queries as much as possible
A small set of CRUD operations typically drives the entire schema design
Document every query with its frequency, acceptable latency, and data volume

M320: Chapter 1: Introduction

Introduction to Data Modeling

Course Prerequisites

Data Modeling in MongoDB

The Document Model in MongoDB

Document Model vs. Relational Model

When to Choose the Document Model

When the Relational Model Is Better

Embedding vs. Referencing

Embedding

Referencing

Decision Guide: Embed or Reference?

Schema Design Principles

Principle 1: Model for Application Needs

Principle 2: Favor Embedding Unless There Is a Reason Not To

Principle 3: Structure Data to Match Query Patterns

Principle 4: Use Arrays for One-to-Many With Moderate Growth

Principle 5: Index According to Query Patterns

Real-World Data Modeling Examples

E-Commerce Application

IoT Sensor Data

Constraints in Data Modeling

The Data Modeling Methodology

Model for Simplicity or Performance

Modeling for Simplicity

Modeling for Performance

Modeling for a Mix of Simplicity and Performance

Identifying the Workload

Case Study: IoT Data Platform

Workload Characteristics

Data Durability Requirements

Comments

Share this article

👍 Was this article helpful?

Introduction to Data Modeling

Course Prerequisites

Data Modeling in MongoDB

The Document Model in MongoDB

Document Model vs. Relational Model

When to Choose the Document Model

When the Relational Model Is Better

Embedding vs. Referencing

Embedding

Referencing

Decision Guide: Embed or Reference?

Schema Design Principles

Principle 1: Model for Application Needs

Principle 2: Favor Embedding Unless There Is a Reason Not To

Principle 3: Structure Data to Match Query Patterns

Principle 4: Use Arrays for One-to-Many With Moderate Growth

Principle 5: Index According to Query Patterns

Real-World Data Modeling Examples

E-Commerce Application

Social Media Platform

IoT Sensor Data

Constraints in Data Modeling

The Data Modeling Methodology

Model for Simplicity or Performance

Modeling for Simplicity

Modeling for Performance

Modeling for a Mix of Simplicity and Performance

Identifying the Workload

Case Study: IoT Data Platform

Workload Characteristics

Data Durability Requirements

Comments

Share this article

👍 Was this article helpful?