Skip to main content

M320: Chapter 1: Introduction

MongoDB Data Modeling

Published: June 12, 2021 Updated: May 24, 2026 Larry Qu 11 min read

Introduction to Data Modeling

Data modeling in MongoDB is the process of designing how data is structured, stored, and accessed. A well-designed data model leads to:

  • Good performance for read and write operations
  • Maximizing developer productivity
  • Minimizing overall infrastructure and operational costs

The data modeling workflow follows four phases:

  1. Gather the requirements — Understand application data needs, query patterns, and performance goals
  2. Create a conceptual model — Map the requirements into an abstract representation of entities and relationships
  3. Apply transformation patterns — Use proven patterns to optimize the model for MongoDB’s document model
  4. Evolve the model — Refactor as application requirements and usage patterns change over time

Course Prerequisites

Here are some terms and references for your benefit:

MongoDB Concepts and Vocabulary

Relational Database Concepts and Vocabulary

General Database Concepts and Definitions

MongoDB Compass and Atlas

Data Modeling in MongoDB

MongoDB is schemaless but not schema-free. Schema refers to the logical structure of data. While MongoDB does not enforce a schema at the database level, every application implicitly defines a schema through its read and write patterns.

ERD and UML tooling can help visualize relationships, but MongoDB data modeling starts from a different question: instead of “what entities exist?”, ask “how is the data accessed?”.

Key factors that drive data modeling decisions:

  • Usage pattern — how your application accesses data
  • Query access — which queries are critical to performance
  • Read-to-write ratio — the proportion of read versus write operations
  • Data growth — how the dataset scales over time

Document validation enforces rules at the collection level:

db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["customerId", "items", "total"],
      properties: {
        customerId: { bsonType: "objectId" },
        total: { bsonType: "number", minimum: 0 },
        status: { enum: ["pending", "shipped", "delivered"] },
      },
    },
  },
});

To join collections, use $lookup in the aggregation pipeline.

The Document Model in MongoDB

BSON (Binary JSON) is the binary representation of JSON-like documents used to store data in MongoDB.

  • MongoDB stores data as documents — self-contained records with a flexible structure
  • Document fields can be values, embedded documents, or arrays of values and documents
  • MongoDB is a flexible schema database that supports polymorphism
// Example document — a user profile with embedded data
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  name: {
    first: "Jane",
    last: "Smith"
  },
  email: "[email protected]",
  roles: ["admin", "editor"],
  address: {
    street: "123 Main St",
    city: "San Francisco",
    state: "CA",
    zip: "94105"
  },
  createdAt: ISODate("2026-01-15T10:30:00Z")
}

Document Model vs. Relational Model

Aspect MongoDB Document Model Relational Model
Data unit Document (JSON/BSON) Row in a table
Schema Flexible, per-document Rigid, defined per table
Relationships Embedded docs or references Foreign key joins
Query language JSON-like query API SQL
Scaling Horizontal (sharding) Vertical (scale up)
Joins $lookup (limited) Native JOIN (powerful)
Transactions Multi-document (since 4.0) ACID from inception
Indexing B-tree indexes on any field B-tree indexes on columns
Atomicity Per-document (single doc) Per-row (single row)

When to Choose the Document Model

The document model excels when:

  • Data has a natural aggregate structure — an order with line items, a blog post with comments
  • The application reads and writes complete entities rather than normalized fragments
  • Schema evolution is frequent — adding fields does not require migrations
  • The system needs horizontal scaling for high write throughput

When the Relational Model Is Better

  • Complex relationships with many-to-many joins across diverse entities
  • Strict referential integrity requirements
  • Highly normalized data with complex reporting queries
  • Existing team expertise and tooling investment

Embedding vs. Referencing

One of the most critical decisions in MongoDB data modeling is whether to embed related data or store it in a separate collection with references.

Embedding

Embedding stores related data inside a parent document. This is the preferred approach for most MongoDB applications.

// Embedded model — order with line items included
{
  _id: ObjectId("..."),
  customerId: ObjectId("..."),
  orderDate: ISODate("2026-03-15"),
  status: "shipped",
  items: [
    { productId: ObjectId("..."), name: "Widget", qty: 2, price: 9.99 },
    { productId: ObjectId("..."), name: "Gadget", qty: 1, price: 24.99 }
  ],
  total: 44.97,
  shipping: {
    address: "123 Main St",
    carrier: "UPS",
    tracking: "1Z999AA10123456784"
  }
}

Advantages of embedding:

  • Single read retrieves all related data — no joins needed
  • Atomic writes for the entire aggregate — consistent updates
  • Lower read latency for data that is always accessed together
  • Reduced number of round trips to the database

Trade-offs:

  • Document size limit of 16 MB limits how much data can be embedded
  • Duplicating data across documents (denormalization) requires careful management
  • Updating embedded data in all parent documents can be expensive

Referencing

Referencing stores related data in separate documents and links them by ID.

// Referenced model — order document
{
  _id: ObjectId("order001"),
  customerId: ObjectId("cust001"),
  orderDate: ISODate("2026-03-15"),
  status: "shipped",
  total: 44.97
}

// Separate items collection
{
  _id: ObjectId("item001"),
  orderId: ObjectId("order001"),
  productId: ObjectId("prod001"),
  name: "Widget",
  qty: 2,
  price: 9.99
}

Advantages of referencing:

  • No document size limit concerns — data grows independently
  • No data duplication — each fact exists in one place
  • Independent updates — updating a product name affects all orders
  • Better for data with complex many-to-many relationships

Trade-offs:

  • Requires $lookup to read related data — additional latency
  • No transactional guarantees across documents (unless using transactions)
  • More queries needed to assemble complete entities

Decision Guide: Embed or Reference?

Scenario Recommended Approach
Data accessed together (contains relationship) Embed
One-to-few relationship (< 100 related items) Embed
Independent lifecycle (related data changes independently) Reference
Many-to-many relationship Reference
Data grows unbounded (comments on a popular post) Reference
Frequently read together, infrequently updated separately Embed
1:1 relationship with tight coupling Embed

Schema Design Principles

Principle 1: Model for Application Needs

Design your schema around how your application queries data, not around abstract entity relationships. Start by listing all application queries and their frequency.

// If the app always loads a user with their recent orders,
// embed recent orders in the user document
{
  _id: ObjectId("..."),
  name: "Jane Smith",
  recentOrders: [
    { orderId: ObjectId("..."), total: 44.97, date: ISODate("2026-03-15") },
    { orderId: ObjectId("..."), total: 19.99, date: ISODate("2026-03-10") }
  ]
}

Principle 2: Favor Embedding Unless There Is a Reason Not To

Embedding is the default choice. Only use references when embedding causes problems such as 16 MB document size violations, excessive data duplication, or complex update propagation.

Principle 3: Structure Data to Match Query Patterns

Design documents to match the shape of query results. If the application displays order summaries on a dashboard, pre-join the data so each document represents exactly one dashboard entry.

Principle 4: Use Arrays for One-to-Many With Moderate Growth

Arrays are ideal for one-to-many relationships where the array size stays manageable (up to a few hundred items).

{
  _id: ObjectId("..."),
  name: "Team Alpha",
  members: [
    { userId: ObjectId("..."), role: "lead" },
    { userId: ObjectId("..."), role: "member" }
  ]
}

Principle 5: Index According to Query Patterns

Every schema design must consider indexing. Without proper indexes, even the best schema design will perform poorly.

// Create indexes that support common queries
db.users.createIndex({ email: 1 }, { unique: true });
db.orders.createIndex({ customerId: 1, orderDate: -1 });
db.products.createIndex({ category: 1, price: 1 });

Real-World Data Modeling Examples

E-Commerce Application

An e-commerce platform needs to model products, customers, orders, and reviews.

// Products collection — embedded variants and categories
{
  _id: ObjectId("..."),
  name: "Wireless Headphones",
  description: "Noise-canceling Bluetooth headphones",
  price: 79.99,
  category: "Electronics",
  tags: ["audio", "wireless", "bluetooth"],
  variants: [
    { color: "Black", sku: "WH-BLK-001", stock: 50 },
    { color: "White", sku: "WH-WHT-001", stock: 30 }
  ],
  reviews: [
    { userId: ObjectId("..."), rating: 4, text: "Great sound", date: ISODate("2026-02-10") }
  ],
  averageRating: 4.2,
  reviewCount: 127,
  createdAt: ISODate("2025-06-01")
}

// Orders collection — embedded line items
{
  _id: ObjectId("..."),
  customerId: ObjectId("..."),
  orderDate: ISODate("2026-03-15"),
  status: "shipped",
  items: [
    { productId: ObjectId("..."), name: "Wireless Headphones", qty: 1, price: 79.99 },
    { productId: ObjectId("..."), name: "USB Cable", qty: 2, price: 5.99 }
  ],
  shipping: {
    address: "456 Oak Ave",
    city: "Portland",
    state: "OR",
    zip: "97201"
  },
  payment: {
    method: "credit_card",
    transactionId: "txn_abc123"
  },
  total: 91.97
}

The products collection embeds variants and reviews because they are always displayed together. The orders collection embeds line items because an order always displays its items, and the number of items per order is bounded.

Social Media Platform

A social media application handles users, posts, comments, and likes.

// Users collection
{
  _id: ObjectId("..."),
  username: "jdoe",
  displayName: "Jane Doe",
  profile: {
    bio: "Software engineer and photographer",
    avatar: "https://cdn.example.com/avatars/jdoe.jpg"
  },
  followersCount: 1240,
  followingCount: 342,
  postCount: 89
}

// Posts collection — references for high-write data
{
  _id: ObjectId("..."),
  userId: ObjectId("..."),
  content: "Just deployed my first microservice!",
  images: ["https://cdn.example.com/posts/img1.jpg"],
  likesCount: 42,
  commentsCount: 7,
  createdAt: ISODate("2026-03-15T08:30:00Z"),
  // last few comments are embedded for fast display
  recentComments: [
    {
      userId: ObjectId("..."),
      username: "msmith",
      text: "Congratulations!",
      createdAt: ISODate("2026-03-15T09:00:00Z")
    }
  ]
}

// Likes collection — referenced for scalability
{
  _id: ObjectId("..."),
  postId: ObjectId("..."),
  userId: ObjectId("..."),
  createdAt: ISODate("2026-03-15T08:35:00Z")
}

The social media model uses a hybrid approach. Recent comments are embedded for instant display while all comments and likes are stored in separate collections for scalable writes. Count fields (likesCount, commentsCount) are pre-computed to avoid counting on every read.

IoT Sensor Data

An IoT platform collects temperature and humidity readings from thousands of sensors.

// Bucket pattern — group readings by time window
{
  _id: ObjectId("..."),
  sensorId: "sensor-abc-001",
  location: {
    building: "Building A",
    floor: 3,
    room: "Conference Room"
  },
  startDate: ISODate("2026-03-15T10:00:00Z"),
  endDate: ISODate("2026-03-15T11:00:00Z"),
  readings: [
    { timestamp: ISODate("2026-03-15T10:00:00Z"), temperature: 22.5, humidity: 45 },
    { timestamp: ISODate("2026-03-15T10:05:00Z"), temperature: 22.7, humidity: 44 },
    // ... up to 12 readings per hour at 5-minute intervals
  ],
  readingCount: 12,
  metadata: {
    model: "TempHumidityPro v2",
    firmware: "3.1.0"
  }
}

The bucket pattern groups sensor readings into hour-long documents. This reduces the total document count from millions to thousands while keeping each document under 16 MB. It also makes time-range queries efficient because all readings for a sensor-hour are in one document.

Constraints in Data Modeling

Hardware and operational constraints shape data modeling decisions:

  • RAM — Keep frequently accessed documents and indexes in RAM for fast access
  • Storage — Prefer solid state drives (SSD) over hard disk drives (HDD)
  • Network — Latency between application and database affects query design
  • Document size — 16 MB maximum document size limits embedded data volume
  • Replication — Write concern and read preference affect consistency and performance
// Monitor working set size to ensure it fits in RAM
db.serverStatus().wiredTiger.cache["bytes currently in the cache"]
Constraint Impact on Modeling
RAM size Limits how much hot data can live in memory
Disk speed Affects read latency for infrequently accessed data
Network bandwidth Encourages embedding to reduce round trips
Document size (16 MB) Limits array and embedded document growth
Index size Must fit in RAM for optimal performance

Recap:

  1. The nature of your dataset and hardware defines the need to model your data
  2. Identify constraints and their impact to create a better model
  3. As your software and technological landscape change, re-evaluate and update your model accordingly

When working with MongoDB, security features, network performance, disk drive speed, and amount of RAM are all aspects you need to keep in mind. The operating system typically abstracts hardware differences from the database.

The Data Modeling Methodology

The MongoDB data modeling methodology follows a flexible, iterative process:

  1. Identify the workload — Document application queries, reads, writes, and performance requirements
  2. Create a conceptual model — Map entities and relationships based on the workload
  3. Apply patterns — Use proven patterns (embedding, referencing, bucket, computed, etc.) to optimize the model
  4. Test and refine — Validate against real-world performance and refactor as needed

Model for Simplicity or Performance

Modeling for Simplicity

The simplest model embeds all related data into a single document. This is easy to develop and maintain but may not perform optimally for all workloads. Use this approach for prototypes, internal tools, and applications with modest traffic.

Modeling for Performance

Performance-optimized models use a combination of embedding, referencing, pre-computation, and strategic indexing. This approach requires more design effort but delivers lower latency and higher throughput. Use this for customer-facing applications, real-time systems, and high-traffic services.

Modeling for a Mix of Simplicity and Performance

Most production systems use a mixed approach: embed data that is always accessed together, reference data that changes independently, and pre-compute values that are read frequently. This balances developer productivity with operational performance.

Summary of Modeling Approaches:

Approach Effort Performance Use Case
Simplicity Low Adequate Prototypes, internal tools
Performance High Optimal Customer-facing, real-time
Mixed Medium High Most production applications

Identifying the Workload

Case Study: IoT Data Platform

  • Organization manages 100 million weather sensors
  • Requirements:
    • Collect data from all devices at 5-minute intervals
    • Analyze data trends with a team of 10 data scientists
    • Support real-time dashboard with sub-second query latency
    • Archive historical data for long-term analysis

Workload Characteristics

Operation Frequency Latency Requirement
Sensor data write 100M writes / 5 min < 10 ms per write
Latest reading query 50,000 reads / sec < 50 ms
Historical trend analysis 100 queries / day < 30 seconds
Alert generation 1,000 alerts / min < 1 second

Data Durability Requirements

Data Type Durability Retention
Raw sensor readings w: majority 90 days
Aggregated hourly stats w: majority 2 years
Alert events w: majority 1 year
System logs w: 1 30 days
  • Quantify and qualify your queries as much as possible
  • A small set of CRUD operations typically drives the entire schema design
  • Document every query with its frequency, acceptable latency, and data volume

Comments

👍 Was this article helpful?