MongoDB Aggregation Framework: Complete Introduction

What is the Aggregation Framework?

The MongoDB Aggregation Framework is a declarative data processing pipeline. Documents enter the pipeline from a collection, pass through one or more stages that transform them, and the results emerge at the end. Each stage applies a specific operation — filtering, grouping, sorting, projecting fields, joining collections, or computing new values — and feeds its output into the next stage.

Unlike find() queries which return a flat set of matching documents, the aggregation framework can reshape data across dimensions: compute sums and averages, reshape document structures, unwind arrays into individual documents, join data from multiple collections, and build result sets that would otherwise require multiple queries and application-side logic.

When to Use Aggregation

Use the aggregation framework when you need:

Computed statistics: totals, averages, minimums, maximums across groups
Data reshaping: transforming nested documents, renaming fields, computing derived values
Multi-collection joins: combining data from related collections without multiple queries
Time-series analysis: bucketing documents by date ranges and computing rolling aggregates
Faceted search: computing multiple aggregations over the same input in a single pass
Conditional logic: applying if-then-else transformations within the pipeline
Large result processing: streaming transformed results to disk or another collection

Prerequisites

A running MongoDB instance (local or Atlas)
The mongo shell or mongosh
A collection with sample data to experiment on
Basic knowledge of MongoDB CRUD operations and query syntax

The Pipeline Concept

A pipeline is a sequence of stages. Each stage is an object that describes a transformation. Documents flow through the pipeline one stage at a time, and each stage can reduce, increase, or reshape the document stream.

Think of an assembly line: raw materials (documents) enter at one end, pass through workstations (stages), and emerge as finished products (aggregated results). If a stage removes documents (like $match), downstream stages see fewer documents. If a stage produces more documents (like $unwind), downstream stages see more.

db.collection.aggregate([
  { $match:   { status: "active" } },
  { $group:   { _id: "$category", total: { $sum: "$amount" } } },
  { $sort:    { total: -1 } },
  { $limit:   10 }
])

The pipeline array is passed to db.collection.aggregate(). Stages execute in the order they appear, and each stage can use any number of expressions — functions that operate on field values, literals, or the results of other expressions.

Stage Ordering Matters

Rearranging stages changes both results and performance. Place $match and $limit as early as possible to reduce the number of documents flowing through downstream stages. A $match at the beginning of a pipeline can use indexes, just like a find() query, while a $match in the middle of a pipeline cannot.

// Efficient: filter first, then transform
db.orders.aggregate([
  { $match:   { date: { $gte: ISODate("2026-01-01") } } },
  { $group:   { _id: "$product", total: { $sum: "$quantity" } } }
])

// Less efficient: transform all documents, then filter
db.orders.aggregate([
  { $group:   { _id: "$product", total: { $sum: "$quantity" } } },
  { $match:   { total: { $gte: 100 } } }
])

Core Stages

MongoDB provides over 30 aggregation stages. The most commonly used stages form the foundation of almost every pipeline.

$match

Filters documents, keeping only those that match the specified condition. Place $match as early as possible in the pipeline. It uses indexes and reduces the number of documents flowing into subsequent stages.

db.sales.aggregate([
  { $match: { region: "EMEA", amount: { $gte: 1000 } } }
])

$project

Passes along documents with only the specified fields. You can include fields, exclude fields, or create new computed fields.

db.sales.aggregate([
  { $project: {
    _id: 0,
    orderId: 1,
    customerName: 1,
    total: { $multiply: ["$price", "$quantity"] },
    tax: { $multiply: ["$price", "$quantity", 0.08] }
  }}
])

$group

Groups documents by a specified expression and applies accumulator expressions to each group. The _id field defines the group key.

db.sales.aggregate([
  { $group: {
    _id: "$region",
    totalSales:  { $sum: "$amount" },
    avgOrder:    { $avg: "$amount" },
    minOrder:    { $min: "$amount" },
    maxOrder:    { $max: "$amount" },
    orderCount:  { $sum: 1 }
  }}
])

$sort

Sorts documents by specified fields. Use 1 for ascending, -1 for descending. When $sort appears early in the pipeline and the sort key is indexed, MongoDB can avoid an in-memory sort.

db.sales.aggregate([
  { $sort: { total: -1, customerName: 1 } }
])

$limit

Passes only the first N documents to the next stage. Combine with $sort to implement top-N queries.

db.sales.aggregate([
  { $sort: { amount: -1 } },
  { $limit: 5 }
])

$skip

Skips the first N documents and passes the rest. Often used with $sort and $limit for pagination.

db.sales.aggregate([
  { $sort: { orderDate: -1 } },
  { $skip: 20 },
  { $limit: 10 }
])

$unwind

Deconstructs an array field, outputting one document per array element. This is one of the most powerful stages because it lets you treat array elements as individual documents for grouping and filtering.

db.orders.aggregate([
  { $unwind: "$items" },
  { $group: {
    _id: "$items.productId",
    totalSold: { $sum: "$items.quantity" }
  }}
])

$lookup

Performs a left outer join with another collection. For each input document, $lookup adds an array field whose elements are matching documents from the “from” collection.

db.orders.aggregate([
  { $lookup: {
    from: "products",
    localField: "productId",
    foreignField: "_id",
    as: "productDetails"
  }},
  { $unwind: "$productDetails" },
  { $project: {
    orderDate: 1,
    productName: "$productDetails.name",
    price: "$productDetails.price"
  }}
])

In MongoDB 5.0 and later, $lookup supports a pipeline syntax for more flexible joins:

db.orders.aggregate([
  { $lookup: {
    from: "products",
    let: { prodId: "$productId" },
    pipeline: [
      { $match: { $expr: { $eq: ["$_id", "$$prodId"] } } },
      { $project: { name: 1, price: 1, category: 1 } }
    ],
    as: "productDetails"
  }}
])

$addFields

Adds new fields to documents without affecting existing fields. Useful for computed values, string manipulations, or date calculations that subsequent stages need.

db.sales.aggregate([
  { $addFields: {
    totalWithTax: { $multiply: ["$amount", 1.08] },
    orderYear: { $year: "$orderDate" },
    orderMonth: { $month: "$orderDate" }
  }}
])

$count

Returns a count of the remaining documents in the pipeline. Useful for quickly checking how many documents survive a series of stages.

db.sales.aggregate([
  { $match: { region: "APAC", amount: { $gte: 500 } } },
  { $count: "highValueApacOrders" }
])
// Result: { "highValueApacOrders": 1247 }

$out

Writes the pipeline results to a collection. If the collection already exists, $out replaces it atomically.

db.sales.aggregate([
  { $match: { year: 2026 } },
  { $group: { _id: "$region", total: { $sum: "$amount" } } },
  { $out: "regional_totals_2026" }
])

$merge

A more flexible alternative to $out. $merge can insert, merge, replace, keep existing, or fail when writing results to a collection. It can also output to a different database.

db.sales.aggregate([
  { $group: { _id: "$region", total: { $sum: "$amount" } } },
  { $merge: {
    into: "region_totals",
    on: "_id",
    whenMatched: "merge",
    whenNotMatched: "insert"
  }}
])

Stage Reference Table

Stage	Purpose	Syntax Example	Memory Limit
`$match`	Filter documents	`{ $match: { status: "active" } }`	100MB default per stage
`$project`	Reshape documents	`{ $project: { name: 1, total: 1 } }`	100MB default per stage
`$group`	Group by key, compute aggregates	`{ $group: { _id: "$city", avg: { $avg: "$age" } } }`	100MB default per stage
`$sort`	Order documents	`{ $sort: { price: -1 } }`	100MB default per stage
`$limit`	Pass first N documents	`{ $limit: 10 }`	None
`$skip`	Skip N documents	`{ $skip: 20 }`	None
`$unwind`	Deconstruct array	`{ $unwind: "$tags" }`	None
`$lookup`	Join with another collection	`{ $lookup: { from: "orders", ... } }`	None (uses indexes)
`$addFields`	Add computed fields	`{ $addFields: { total: { $add: ["$a", "$b"] } } }`	None
`$count`	Count remaining docs	`{ $count: "total" }`	None
`$out`	Write results to collection	`{ $out: "summary" }`	None
`$merge`	Merge into collection	`{ $merge: { into: "target" } }`	None
`$bucket`	Bucket documents into ranges	`{ $bucket: { groupBy: "$price", boundaries: [0,50,100] } }`	100MB default
`$facet`	Multi-faceted aggregation	`{ $facet: { stats: [...], top: [...] } }`	100MB per facet
`$replaceWith`	Replace doc with sub-doc	`{ $replaceWith: "$address" }`	None

Stages marked with a 100MB memory limit throw an error if they exceed that limit. Use { allowDiskUse: true } to bypass the limit by writing temporary data to disk.

Expressions and Operators

Expressions are the building blocks of aggregation stages. They evaluate to a value and can reference field paths, literals, or other expressions.

Field Paths and Variables

$fieldName — references the value of fieldName in the current document
$$CURRENT.fieldName — same as $fieldName, explicit syntax
$$ROOT — references the root document
$$NOW — the current datetime
$$CLUSTER_TIME — the current cluster time

Accumulators (used in $group)

Operator	Description	Example
`$sum`	Sum of values	`{ $sum: "$amount" }`
`$avg`	Average of values	`{ $avg: "$score" }`
`$min`	Minimum value	`{ $min: "$price" }`
`$max`	Maximum value	`{ $max: "$price" }`
`$push`	Array of all values	`{ $push: "$name" }`
`$addToSet`	Array of unique values	`{ $addToSet: "$category" }`
`$first`	First value in group	`{ $first: "$createdAt" }`
`$last`	Last value in group	`{ $last: "$updatedAt" }`
`$stdDevPop`	Population standard deviation	`{ $stdDevPop: "$score" }`
`$stdDevSamp`	Sample standard deviation	`{ $stdDevSamp: "$score" }`

Arithmetic Expression Operators

{ $add: ["$price", "$tax"] }
{ $subtract: ["$total", "$discount"] }
{ $multiply: ["$quantity", "$unitPrice"] }
{ $divide: ["$total", "$count"] }
{ $mod: ["$amount", 100] }
{ $abs: "$netValue" }
{ $ceil: "$average" }
{ $floor: "$average" }
{ $round: ["$amount", 2] }

String Expression Operators

{ $concat: ["$firstName", " ", "$lastName"] }
{ $toUpper: "$name" }
{ $toLower: "$email" }
{ $substr: ["$phone", 0, 3] }
{ $trim: { input: "$username" } }
{ $split: ["$tags", ","] }
{ $regexMatch: { input: "$email", regex: /@company\.com$/ } }

Date Expression Operators

{ $year: "$orderDate" }
{ $month: "$orderDate" }
{ $dayOfMonth: "$orderDate" }
{ $dayOfWeek: "$orderDate" }
{ $hour: "$timestamp" }
{ $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } }
{ $dateFromParts: { year: "$year", month: "$month", day: "$day" } }

Conditional Expression Operators

$cond evaluates a boolean expression and returns one of two values:

db.sales.aggregate([
  { $project: {
    orderId: 1,
    status: {
      $cond: {
        if:   { $gte: ["$amount", 1000] },
        then: "High Value",
        else: "Standard"
      }
    }
  }}
])

$switch handles multiple conditions:

db.sales.aggregate([
  { $project: {
    orderId: 1,
    tier: {
      $switch: {
        branches: [
          { case: { $gte: ["$amount", 10000] }, then: "Platinum" },
          { case: { $gte: ["$amount", 5000] },  then: "Gold" },
          { case: { $gte: ["$amount", 1000] },  then: "Silver" }
        ],
        default: "Bronze"
      }
    }
  }}
])

$ifNull returns a substitute value when a field is null or missing:

{ $ifNull: ["$discount", 0] }

Complete Pipeline Examples

Basic Pipeline: Top Products by Revenue

db.orders.aggregate([
  { $match:   { status: "completed" } },
  { $unwind:  "$items" },
  { $group:   {
    _id: "$items.productName",
    revenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } },
    unitsSold: { $sum: "$items.quantity" }
  }},
  { $sort:    { revenue: -1 } },
  { $limit:   10 },
  { $project: {
    _id: 0,
    product: "$_id",
    revenue: { $round: ["$revenue", 2] },
    unitsSold: 1
  }}
])

Pipeline with $lookup and $unwind

db.orders.aggregate([
  { $match:   { status: "shipped" } },
  { $lookup:  {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customer"
  }},
  { $unwind:  "$customer" },
  { $group:   {
    _id: "$customer.region",
    totalRevenue: { $sum: "$total" },
    topCustomer: { $max: "$customer.name" },
    orderCount:  { $sum: 1 }
  }},
  { $sort:    { totalRevenue: -1 } },
  { $out:     "regional_revenue" }
])

Conditional Aggregation with $cond

db.sales.aggregate([
  { $group: {
    _id: "$salesperson",
    total:      { $sum: "$amount" },
    highValue:  { $sum: { $cond: [{ $gte: ["$amount", 1000] }, 1, 0] } },
    lowValue:   { $sum: { $cond: [{ $lt: ["$amount", 1000] }, 1, 0] } }
  }},
  { $project: {
    salesperson: "$_id",
    total: 1,
    highValue: 1,
    lowValue: 1,
    highValueRate: {
      $cond: [
        { $gt: ["$total", 0] },
        { $round: [{ $divide: ["$highValue", { $add: ["$highValue", "$lowValue"] }] }, 2] },
        0
      ]
    }
  }}
])

Bucket Stage for Histograms

$bucket categorizes documents into ranges. This example creates a price histogram:

db.products.aggregate([
  { $bucket: {
    groupBy: "$price",
    boundaries: [0, 10, 25, 50, 100, 250, 500, 1000],
    default: "Other",
    output: {
      count: { $sum: 1 },
      avgPrice: { $avg: "$price" },
      products: { $push: "$name" }
    }
  }}
])

$facet runs multiple sub-pipelines on the same set of input documents. Each sub-pipeline produces its own output array:

db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2026-01-01") } } },
  { $facet: {
    revenueByRegion: [
      { $group: { _id: "$region", revenue: { $sum: "$total" } } },
      { $sort: { revenue: -1 } }
    ],
    topProducts: [
      { $unwind: "$items" },
      { $group: { _id: "$items.productName", sold: { $sum: "$items.quantity" } } },
      { $sort: { sold: -1 } },
      { $limit: 5 }
    ],
    summary: [
      { $group: {
        _id: null,
        totalOrders: { $sum: 1 },
        totalRevenue: { $sum: "$total" },
        avgOrderValue: { $avg: "$total" }
      }}
    ]
  }}
])

The result is a single document with three keys, each containing an array:

{
  "revenueByRegion": [ { "_id": "EMEA", "revenue": 452000 }, ... ],
  "topProducts": [ { "_id": "Widget Pro", "sold": 1420 }, ... ],
  "summary": [ { "_id": null, "totalOrders": 8453, "totalRevenue": 1892000, "avgOrderValue": 223.81 } ]
}

Aggregation Options

The aggregate() method accepts an optional second parameter — an options document:

db.collection.aggregate(pipeline, {
  allowDiskUse: true,
  explain: true,
  collation: { locale: "en", strength: 2 },
  hint: { status: 1, date: -1 },
  bypassDocumentValidation: true,
  comment: "monthly-report-q1-2026"
})

Option	Type	Description
`allowDiskUse`	boolean	Enable disk-based sorting and grouping for stages exceeding 100MB memory
`explain`	boolean	Return query execution plan instead of results
`collation`	document	Specify locale rules for string comparison
`hint`	document	Force use of a specific index
`bypassDocumentValidation`	boolean	Skip document validation on write operations
`comment`	string	Attach a comment visible in logs and profiler

allowDiskUse: true

Stages like $sort and $group have a 100MB memory limit. When this limit is reached, MongoDB returns an error unless allowDiskUse: true is specified, which spills temporary data to disk:

db.largeCollection.aggregate([
  { $group: { _id: "$field", total: { $sum: "$value" } } },
  { $sort: { total: -1 } }
], { allowDiskUse: true })

explain: true

Use explain to understand how MongoDB executes your pipeline, which indexes it uses, and how many documents each stage processes:

const explanation = db.orders.explain("executionStats").aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } }
])

Collation

Collation specifies language-specific rules for string comparison — case sensitivity, accent sensitivity, and locale:

db.products.aggregate([
  { $match: { name: { $gte: "apple", $lte: "banana" } } },
  { $group: { _id: "$category", count: { $sum: 1 } } }
], { collation: { locale: "en", strength: 1 } })
// strength: 1 ignores case and diacritics

Hint

Force the query optimizer to use a specific index:

db.orders.aggregate([
  { $match: { status: "pending" } },
  { $group: { _id: "$region", count: { $sum: 1 } } }
], { hint: { status: 1 } })

Pipeline Optimization

Projection Optimization

If a later stage only needs a subset of fields, project early to reduce the document size flowing through the pipeline:

// Without optimization: all fields pass through $match
db.users.aggregate([
  { $match: { status: "active" } },
  { $group: { _id: "$department", avg: { $avg: "$salary" } } }
])

// With optimization: discard unused fields before $group
db.users.aggregate([
  { $match: { status: "active" } },
  { $project: { department: 1, salary: 1, _id: 0 } },
  { $group: { _id: "$department", avg: { $avg: "$salary" } } }
])

Match Optimization

Place $match as early as possible. MongoDB can use indexes for $match only when it is the first stage in the pipeline:

// Optimal: $match uses index on status + date
db.orders.aggregate([
  { $match: { status: "completed", date: { $gte: ISODate("2026-01-01") } } },
  { $group: { _id: "$region", total: { $sum: "$amount" } } }
])

// Suboptimal: $sort executes before $match, no index benefit
db.orders.aggregate([
  { $sort: { amount: -1 } },
  { $match: { status: "completed" } }
])

Sort Optimization

When $sort immediately follows $match or appears early in the pipeline, MongoDB can use an index for the sort. If not, MongoDB performs an in-memory sort limited to 100MB by default:

// Can use index on { date: -1 }
db.events.aggregate([
  { $match: { type: "purchase" } },
  { $sort: { date: -1 } },
  { $limit: 50 }
])

// Cannot use index for sort (after $group)
db.events.aggregate([
  { $group: { _id: "$type", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])

Limit and Sort Together

When $sort is followed by $limit, MongoDB optimizes this into a top-k sort that only keeps the K best results in memory:

// MongoDB's top-k optimization: only 10 results kept in memory during sort
db.products.aggregate([
  { $sort: { rating: -1 } },
  { $limit: 10 }
])

Comparison: Aggregation vs find()

Feature	`find()`	`aggregate()`
Simple filtering	Yes	Yes
Field projection	Yes (limited)	Yes (full)
Sorting	Yes	Yes
Grouping / aggregation	No	Yes
Joins	No (`$lookup`)	Yes
Computed fields	No	Yes
Array unwinding	No	Yes
Conditional logic	No	Yes
Multi-stage transformations	No	Yes
Pipeline limit	N/A	100MB per stage (default)
Results as cursor	Yes	Yes (v3.2+)
Results to collection	No	Yes (`$out`, `$merge`)

When find() is Better

Use find() when you need a simple filter with a projection and sort. It is easier to write, supports cursor-based iteration efficiently, and integrates with the Query Planner for optimal index selection:

db.products.find(
  { category: "Electronics", price: { $lte: 100 } },
  { name: 1, price: 1, _id: 0 }
).sort({ price: 1 }).limit(20)

When Aggregation is Better

Use aggregate() when you need to group, compute, join, or reshape data. The pipeline model lets you compose complex transformations that would require multiple queries and application-side logic with find():

// This single pipeline replaces: a find, a group + reduce in app code,
// a join query, and a sort
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $lookup: { from: "customers", localField: "customerId", foreignField: "_id", as: "customer" }},
  { $unwind: "$customer" },
  { $group: { _id: "$customer.region", total: { $sum: "$total" }, orders: { $push: "$_id" } }},
  { $sort: { total: -1 } }
])

Performance Tips

Position $match and $limit first: Reduces the number of documents entering subsequent stages. A first-stage $match can use indexes.
Use indexes for early stages: Create indexes that support your $match conditions and any early $sort operations.
Project early, project often: Remove unnecessary fields with $project as soon as possible to reduce memory pressure.
Prefer $match over $project for filtering: $project cannot use indexes. Filter with $match first, then transform with $project.
Use allowDiskUse for large datasets: Without this flag, stages like $sort and $group fail when they exceed 100MB of RAM.
Prefer $addFields over $project for adding fields: $addFields preserves existing fields, so you don’t need to explicitly list every field you want to keep.
**Use $lookup with indexes**: Ensure the foreign field in the joined collection is indexed. In MongoDB 5.0+, use the pipeline syntax with `$match` on indexed fields.
Avoid $unwind followed by $group when possible: If you only need aggregated values from array elements, consider using $reduce or array expression operators instead, which avoid document multiplication.
Monitor stage memory with explain(): Use db.collection.explain("executionStats").aggregate(pipeline) to see how many documents each stage processes and whether indexes are being used.
**Use $facet sparingly**: While powerful, `$facet` runs all sub-pipelines on the same input set, so expensive operations inside sub-pipelines multiply their cost. Each sub-pipeline also has its own 100MB memory limit.

Resources

Aggregation Pipeline Quick Reference — official stage and operator reference
Aggregation Pipeline Stages — complete stage documentation
Aggregation Pipeline Optimization — MongoDB’s automatic optimizations
Aggregation Pipeline Limits — memory limits and restrictions
SQL to Aggregation Mapping Chart — translate SQL queries to aggregation pipelines
M121: MongoDB Aggregation Framework — the original MongoDB University course

What is the Aggregation Framework?

When to Use Aggregation

Prerequisites

The Pipeline Concept

Stage Ordering Matters

Core Stages

$match

$project

$group

$sort

$limit

$skip

$unwind

$lookup

$addFields

$count

$out

$merge

Stage Reference Table

Expressions and Operators

Field Paths and Variables

Accumulators (used in $group)

Arithmetic Expression Operators

String Expression Operators

Date Expression Operators

Conditional Expression Operators

Complete Pipeline Examples

Basic Pipeline: Top Products by Revenue

Pipeline with $lookup and $unwind

Conditional Aggregation with $cond

Bucket Stage for Histograms

Facet Stage for Multi-Faceted Analysis

Aggregation Options

allowDiskUse: true

explain: true

Collation

Hint

Pipeline Optimization

Projection Optimization

Match Optimization

Sort Optimization

Limit and Sort Together

Comparison: Aggregation vs find()

When find() is Better

When Aggregation is Better

Performance Tips

Resources

Comments

Share this article

👍 Was this article helpful?