Skip to main content

MongoDB Aggregation Framework: Complete Introduction

The MongoDB Aggregation Framework(Chapter 0 of 6)

Created: June 12, 2021 16 min read

What is the Aggregation Framework?

The MongoDB Aggregation Framework is a declarative data processing pipeline. Documents enter the pipeline from a collection, pass through one or more stages that transform them, and the results emerge at the end. Each stage applies a specific operation — filtering, grouping, sorting, projecting fields, joining collections, or computing new values — and feeds its output into the next stage.

Unlike find() queries which return a flat set of matching documents, the aggregation framework can reshape data across dimensions: compute sums and averages, reshape document structures, unwind arrays into individual documents, join data from multiple collections, and build result sets that would otherwise require multiple queries and application-side logic.

When to Use Aggregation

Use the aggregation framework when you need:

  • Computed statistics: totals, averages, minimums, maximums across groups
  • Data reshaping: transforming nested documents, renaming fields, computing derived values
  • Multi-collection joins: combining data from related collections without multiple queries
  • Time-series analysis: bucketing documents by date ranges and computing rolling aggregates
  • Faceted search: computing multiple aggregations over the same input in a single pass
  • Conditional logic: applying if-then-else transformations within the pipeline
  • Large result processing: streaming transformed results to disk or another collection

Prerequisites

  • A running MongoDB instance (local or Atlas)
  • The mongo shell or mongosh
  • A collection with sample data to experiment on
  • Basic knowledge of MongoDB CRUD operations and query syntax

The Pipeline Concept

A pipeline is a sequence of stages. Each stage is an object that describes a transformation. Documents flow through the pipeline one stage at a time, and each stage can reduce, increase, or reshape the document stream.

Think of an assembly line: raw materials (documents) enter at one end, pass through workstations (stages), and emerge as finished products (aggregated results). If a stage removes documents (like $match), downstream stages see fewer documents. If a stage produces more documents (like $unwind), downstream stages see more.

db.collection.aggregate([
  { $match:   { status: "active" } },
  { $group:   { _id: "$category", total: { $sum: "$amount" } } },
  { $sort:    { total: -1 } },
  { $limit:   10 }
])

The pipeline array is passed to db.collection.aggregate(). Stages execute in the order they appear, and each stage can use any number of expressions — functions that operate on field values, literals, or the results of other expressions.

Stage Ordering Matters

Rearranging stages changes both results and performance. Place $match and $limit as early as possible to reduce the number of documents flowing through downstream stages. A $match at the beginning of a pipeline can use indexes, just like a find() query, while a $match in the middle of a pipeline cannot.

// Efficient: filter first, then transform
db.orders.aggregate([
  { $match:   { date: { $gte: ISODate("2026-01-01") } } },
  { $group:   { _id: "$product", total: { $sum: "$quantity" } } }
])

// Less efficient: transform all documents, then filter
db.orders.aggregate([
  { $group:   { _id: "$product", total: { $sum: "$quantity" } } },
  { $match:   { total: { $gte: 100 } } }
])

Core Stages

MongoDB provides over 30 aggregation stages. The most commonly used stages form the foundation of almost every pipeline.

$match

Filters documents, keeping only those that match the specified condition. Place $match as early as possible in the pipeline. It uses indexes and reduces the number of documents flowing into subsequent stages.

db.sales.aggregate([
  { $match: { region: "EMEA", amount: { $gte: 1000 } } }
])

$project

Passes along documents with only the specified fields. You can include fields, exclude fields, or create new computed fields.

db.sales.aggregate([
  { $project: {
    _id: 0,
    orderId: 1,
    customerName: 1,
    total: { $multiply: ["$price", "$quantity"] },
    tax: { $multiply: ["$price", "$quantity", 0.08] }
  }}
])

$group

Groups documents by a specified expression and applies accumulator expressions to each group. The _id field defines the group key.

db.sales.aggregate([
  { $group: {
    _id: "$region",
    totalSales:  { $sum: "$amount" },
    avgOrder:    { $avg: "$amount" },
    minOrder:    { $min: "$amount" },
    maxOrder:    { $max: "$amount" },
    orderCount:  { $sum: 1 }
  }}
])

$sort

Sorts documents by specified fields. Use 1 for ascending, -1 for descending. When $sort appears early in the pipeline and the sort key is indexed, MongoDB can avoid an in-memory sort.

db.sales.aggregate([
  { $sort: { total: -1, customerName: 1 } }
])

$limit

Passes only the first N documents to the next stage. Combine with $sort to implement top-N queries.

db.sales.aggregate([
  { $sort: { amount: -1 } },
  { $limit: 5 }
])

$skip

Skips the first N documents and passes the rest. Often used with $sort and $limit for pagination.

db.sales.aggregate([
  { $sort: { orderDate: -1 } },
  { $skip: 20 },
  { $limit: 10 }
])

$unwind

Deconstructs an array field, outputting one document per array element. This is one of the most powerful stages because it lets you treat array elements as individual documents for grouping and filtering.

db.orders.aggregate([
  { $unwind: "$items" },
  { $group: {
    _id: "$items.productId",
    totalSold: { $sum: "$items.quantity" }
  }}
])

$lookup

Performs a left outer join with another collection. For each input document, $lookup adds an array field whose elements are matching documents from the “from” collection.

db.orders.aggregate([
  { $lookup: {
    from: "products",
    localField: "productId",
    foreignField: "_id",
    as: "productDetails"
  }},
  { $unwind: "$productDetails" },
  { $project: {
    orderDate: 1,
    productName: "$productDetails.name",
    price: "$productDetails.price"
  }}
])

In MongoDB 5.0 and later, $lookup supports a pipeline syntax for more flexible joins:

db.orders.aggregate([
  { $lookup: {
    from: "products",
    let: { prodId: "$productId" },
    pipeline: [
      { $match: { $expr: { $eq: ["$_id", "$$prodId"] } } },
      { $project: { name: 1, price: 1, category: 1 } }
    ],
    as: "productDetails"
  }}
])

$addFields

Adds new fields to documents without affecting existing fields. Useful for computed values, string manipulations, or date calculations that subsequent stages need.

db.sales.aggregate([
  { $addFields: {
    totalWithTax: { $multiply: ["$amount", 1.08] },
    orderYear: { $year: "$orderDate" },
    orderMonth: { $month: "$orderDate" }
  }}
])

$count

Returns a count of the remaining documents in the pipeline. Useful for quickly checking how many documents survive a series of stages.

db.sales.aggregate([
  { $match: { region: "APAC", amount: { $gte: 500 } } },
  { $count: "highValueApacOrders" }
])
// Result: { "highValueApacOrders": 1247 }

$out

Writes the pipeline results to a collection. If the collection already exists, $out replaces it atomically.

db.sales.aggregate([
  { $match: { year: 2026 } },
  { $group: { _id: "$region", total: { $sum: "$amount" } } },
  { $out: "regional_totals_2026" }
])

$merge

A more flexible alternative to $out. $merge can insert, merge, replace, keep existing, or fail when writing results to a collection. It can also output to a different database.

db.sales.aggregate([
  { $group: { _id: "$region", total: { $sum: "$amount" } } },
  { $merge: {
    into: "region_totals",
    on: "_id",
    whenMatched: "merge",
    whenNotMatched: "insert"
  }}
])

Stage Reference Table

Stage Purpose Syntax Example Memory Limit
$match Filter documents { $match: { status: "active" } } 100MB default per stage
$project Reshape documents { $project: { name: 1, total: 1 } } 100MB default per stage
$group Group by key, compute aggregates { $group: { _id: "$city", avg: { $avg: "$age" } } } 100MB default per stage
$sort Order documents { $sort: { price: -1 } } 100MB default per stage
$limit Pass first N documents { $limit: 10 } None
$skip Skip N documents { $skip: 20 } None
$unwind Deconstruct array { $unwind: "$tags" } None
$lookup Join with another collection { $lookup: { from: "orders", ... } } None (uses indexes)
$addFields Add computed fields { $addFields: { total: { $add: ["$a", "$b"] } } } None
$count Count remaining docs { $count: "total" } None
$out Write results to collection { $out: "summary" } None
$merge Merge into collection { $merge: { into: "target" } } None
$bucket Bucket documents into ranges { $bucket: { groupBy: "$price", boundaries: [0,50,100] } } 100MB default
$facet Multi-faceted aggregation { $facet: { stats: [...], top: [...] } } 100MB per facet
$replaceWith Replace doc with sub-doc { $replaceWith: "$address" } None

Stages marked with a 100MB memory limit throw an error if they exceed that limit. Use { allowDiskUse: true } to bypass the limit by writing temporary data to disk.


Expressions and Operators

Expressions are the building blocks of aggregation stages. They evaluate to a value and can reference field paths, literals, or other expressions.

Field Paths and Variables

  • $fieldName — references the value of fieldName in the current document
  • $$CURRENT.fieldName — same as $fieldName, explicit syntax
  • $$ROOT — references the root document
  • $$NOW — the current datetime
  • $$CLUSTER_TIME — the current cluster time

Accumulators (used in $group)

Operator Description Example
$sum Sum of values { $sum: "$amount" }
$avg Average of values { $avg: "$score" }
$min Minimum value { $min: "$price" }
$max Maximum value { $max: "$price" }
$push Array of all values { $push: "$name" }
$addToSet Array of unique values { $addToSet: "$category" }
$first First value in group { $first: "$createdAt" }
$last Last value in group { $last: "$updatedAt" }
$stdDevPop Population standard deviation { $stdDevPop: "$score" }
$stdDevSamp Sample standard deviation { $stdDevSamp: "$score" }

Arithmetic Expression Operators

{ $add: ["$price", "$tax"] }
{ $subtract: ["$total", "$discount"] }
{ $multiply: ["$quantity", "$unitPrice"] }
{ $divide: ["$total", "$count"] }
{ $mod: ["$amount", 100] }
{ $abs: "$netValue" }
{ $ceil: "$average" }
{ $floor: "$average" }
{ $round: ["$amount", 2] }

String Expression Operators

{ $concat: ["$firstName", " ", "$lastName"] }
{ $toUpper: "$name" }
{ $toLower: "$email" }
{ $substr: ["$phone", 0, 3] }
{ $trim: { input: "$username" } }
{ $split: ["$tags", ","] }
{ $regexMatch: { input: "$email", regex: /@company\.com$/ } }

Date Expression Operators

{ $year: "$orderDate" }
{ $month: "$orderDate" }
{ $dayOfMonth: "$orderDate" }
{ $dayOfWeek: "$orderDate" }
{ $hour: "$timestamp" }
{ $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } }
{ $dateFromParts: { year: "$year", month: "$month", day: "$day" } }

Conditional Expression Operators

$cond evaluates a boolean expression and returns one of two values:

db.sales.aggregate([
  { $project: {
    orderId: 1,
    status: {
      $cond: {
        if:   { $gte: ["$amount", 1000] },
        then: "High Value",
        else: "Standard"
      }
    }
  }}
])

$switch handles multiple conditions:

db.sales.aggregate([
  { $project: {
    orderId: 1,
    tier: {
      $switch: {
        branches: [
          { case: { $gte: ["$amount", 10000] }, then: "Platinum" },
          { case: { $gte: ["$amount", 5000] },  then: "Gold" },
          { case: { $gte: ["$amount", 1000] },  then: "Silver" }
        ],
        default: "Bronze"
      }
    }
  }}
])

$ifNull returns a substitute value when a field is null or missing:

{ $ifNull: ["$discount", 0] }

Complete Pipeline Examples

Basic Pipeline: Top Products by Revenue

db.orders.aggregate([
  { $match:   { status: "completed" } },
  { $unwind:  "$items" },
  { $group:   {
    _id: "$items.productName",
    revenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } },
    unitsSold: { $sum: "$items.quantity" }
  }},
  { $sort:    { revenue: -1 } },
  { $limit:   10 },
  { $project: {
    _id: 0,
    product: "$_id",
    revenue: { $round: ["$revenue", 2] },
    unitsSold: 1
  }}
])

Pipeline with $lookup and $unwind

db.orders.aggregate([
  { $match:   { status: "shipped" } },
  { $lookup:  {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customer"
  }},
  { $unwind:  "$customer" },
  { $group:   {
    _id: "$customer.region",
    totalRevenue: { $sum: "$total" },
    topCustomer: { $max: "$customer.name" },
    orderCount:  { $sum: 1 }
  }},
  { $sort:    { totalRevenue: -1 } },
  { $out:     "regional_revenue" }
])

Conditional Aggregation with $cond

db.sales.aggregate([
  { $group: {
    _id: "$salesperson",
    total:      { $sum: "$amount" },
    highValue:  { $sum: { $cond: [{ $gte: ["$amount", 1000] }, 1, 0] } },
    lowValue:   { $sum: { $cond: [{ $lt: ["$amount", 1000] }, 1, 0] } }
  }},
  { $project: {
    salesperson: "$_id",
    total: 1,
    highValue: 1,
    lowValue: 1,
    highValueRate: {
      $cond: [
        { $gt: ["$total", 0] },
        { $round: [{ $divide: ["$highValue", { $add: ["$highValue", "$lowValue"] }] }, 2] },
        0
      ]
    }
  }}
])

Bucket Stage for Histograms

$bucket categorizes documents into ranges. This example creates a price histogram:

db.products.aggregate([
  { $bucket: {
    groupBy: "$price",
    boundaries: [0, 10, 25, 50, 100, 250, 500, 1000],
    default: "Other",
    output: {
      count: { $sum: 1 },
      avgPrice: { $avg: "$price" },
      products: { $push: "$name" }
    }
  }}
])

Facet Stage for Multi-Faceted Analysis

$facet runs multiple sub-pipelines on the same set of input documents. Each sub-pipeline produces its own output array:

db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2026-01-01") } } },
  { $facet: {
    revenueByRegion: [
      { $group: { _id: "$region", revenue: { $sum: "$total" } } },
      { $sort: { revenue: -1 } }
    ],
    topProducts: [
      { $unwind: "$items" },
      { $group: { _id: "$items.productName", sold: { $sum: "$items.quantity" } } },
      { $sort: { sold: -1 } },
      { $limit: 5 }
    ],
    summary: [
      { $group: {
        _id: null,
        totalOrders: { $sum: 1 },
        totalRevenue: { $sum: "$total" },
        avgOrderValue: { $avg: "$total" }
      }}
    ]
  }}
])

The result is a single document with three keys, each containing an array:

{
  "revenueByRegion": [ { "_id": "EMEA", "revenue": 452000 }, ... ],
  "topProducts": [ { "_id": "Widget Pro", "sold": 1420 }, ... ],
  "summary": [ { "_id": null, "totalOrders": 8453, "totalRevenue": 1892000, "avgOrderValue": 223.81 } ]
}

Aggregation Options

The aggregate() method accepts an optional second parameter — an options document:

db.collection.aggregate(pipeline, {
  allowDiskUse: true,
  explain: true,
  collation: { locale: "en", strength: 2 },
  hint: { status: 1, date: -1 },
  bypassDocumentValidation: true,
  comment: "monthly-report-q1-2026"
})
Option Type Description
allowDiskUse boolean Enable disk-based sorting and grouping for stages exceeding 100MB memory
explain boolean Return query execution plan instead of results
collation document Specify locale rules for string comparison
hint document Force use of a specific index
bypassDocumentValidation boolean Skip document validation on write operations
comment string Attach a comment visible in logs and profiler

allowDiskUse: true

Stages like $sort and $group have a 100MB memory limit. When this limit is reached, MongoDB returns an error unless allowDiskUse: true is specified, which spills temporary data to disk:

db.largeCollection.aggregate([
  { $group: { _id: "$field", total: { $sum: "$value" } } },
  { $sort: { total: -1 } }
], { allowDiskUse: true })

explain: true

Use explain to understand how MongoDB executes your pipeline, which indexes it uses, and how many documents each stage processes:

const explanation = db.orders.explain("executionStats").aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } }
])

Collation

Collation specifies language-specific rules for string comparison — case sensitivity, accent sensitivity, and locale:

db.products.aggregate([
  { $match: { name: { $gte: "apple", $lte: "banana" } } },
  { $group: { _id: "$category", count: { $sum: 1 } } }
], { collation: { locale: "en", strength: 1 } })
// strength: 1 ignores case and diacritics

Hint

Force the query optimizer to use a specific index:

db.orders.aggregate([
  { $match: { status: "pending" } },
  { $group: { _id: "$region", count: { $sum: 1 } } }
], { hint: { status: 1 } })

Pipeline Optimization

Projection Optimization

If a later stage only needs a subset of fields, project early to reduce the document size flowing through the pipeline:

// Without optimization: all fields pass through $match
db.users.aggregate([
  { $match: { status: "active" } },
  { $group: { _id: "$department", avg: { $avg: "$salary" } } }
])

// With optimization: discard unused fields before $group
db.users.aggregate([
  { $match: { status: "active" } },
  { $project: { department: 1, salary: 1, _id: 0 } },
  { $group: { _id: "$department", avg: { $avg: "$salary" } } }
])

Match Optimization

Place $match as early as possible. MongoDB can use indexes for $match only when it is the first stage in the pipeline:

// Optimal: $match uses index on status + date
db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2026-01-01") } } },
  { $group: { _id: "$region", total: { $sum: "$amount" } } }
])

// Suboptimal: $sort executes before $match, no index benefit
db.orders.aggregate([
  { $sort: { amount: -1 } },
  { $match: { status: "completed" } }
])

Sort Optimization

When $sort immediately follows $match or appears early in the pipeline, MongoDB can use an index for the sort. If not, MongoDB performs an in-memory sort limited to 100MB by default:

// Can use index on { date: -1 }
db.events.aggregate([
  { $match: { type: "purchase" } },
  { $sort: { date: -1 } },
  { $limit: 50 }
])

// Cannot use index for sort (after $group)
db.events.aggregate([
  { $group: { _id: "$type", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])

Limit and Sort Together

When $sort is followed by $limit, MongoDB optimizes this into a top-k sort that only keeps the K best results in memory:

// MongoDB's top-k optimization: only 10 results kept in memory during sort
db.products.aggregate([
  { $sort: { rating: -1 } },
  { $limit: 10 }
])

Comparison: Aggregation vs find()

Feature find() aggregate()
Simple filtering Yes Yes
Field projection Yes (limited) Yes (full)
Sorting Yes Yes
Grouping / aggregation No Yes
Joins No ($lookup) Yes
Computed fields No Yes
Array unwinding No Yes
Conditional logic No Yes
Multi-stage transformations No Yes
Pipeline limit N/A 100MB per stage (default)
Results as cursor Yes Yes (v3.2+)
Results to collection No Yes ($out, $merge)

When find() is Better

Use find() when you need a simple filter with a projection and sort. It is easier to write, supports cursor-based iteration efficiently, and integrates with the Query Planner for optimal index selection:

db.products.find(
  { category: "Electronics", price: { $lte: 100 } },
  { name: 1, price: 1, _id: 0 }
).sort({ price: 1 }).limit(20)

When Aggregation is Better

Use aggregate() when you need to group, compute, join, or reshape data. The pipeline model lets you compose complex transformations that would require multiple queries and application-side logic with find():

// This single pipeline replaces: a find, a group + reduce in app code,
// a join query, and a sort
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $lookup: { from: "customers", localField: "customerId", foreignField: "_id", as: "customer" }},
  { $unwind: "$customer" },
  { $group: { _id: "$customer.region", total: { $sum: "$total" }, orders: { $push: "$_id" } }},
  { $sort: { total: -1 } }
])

Performance Tips

  1. Position $match and $limit first: Reduces the number of documents entering subsequent stages. A first-stage $match can use indexes.

  2. Use indexes for early stages: Create indexes that support your $match conditions and any early $sort operations.

  3. Project early, project often: Remove unnecessary fields with $project as soon as possible to reduce memory pressure.

  4. Prefer $match over $project for filtering: $project cannot use indexes. Filter with $match first, then transform with $project.

  5. Use allowDiskUse for large datasets: Without this flag, stages like $sort and $group fail when they exceed 100MB of RAM.

  6. Prefer $addFields over $project for adding fields: $addFields preserves existing fields, so you don’t need to explicitly list every field you want to keep.

  7. **Use $lookup with indexes**: Ensure the foreign field in the joined collection is indexed. In MongoDB 5.0+, use the pipeline syntax with `$match` on indexed fields.

  8. Avoid $unwind followed by $group when possible: If you only need aggregated values from array elements, consider using $reduce or array expression operators instead, which avoid document multiplication.

  9. Monitor stage memory with explain(): Use db.collection.explain("executionStats").aggregate(pipeline) to see how many documents each stage processes and whether indexes are being used.

  10. **Use $facet sparingly**: While powerful, `$facet` runs all sub-pipelines on the same input set, so expensive operations inside sub-pipelines multiply their cost. Each sub-pipeline also has its own 100MB memory limit.


Resources

Comments

Share this article

Scan to read on mobile