What is the Aggregation Framework?
The MongoDB Aggregation Framework is a declarative data processing pipeline. Documents enter the pipeline from a collection, pass through one or more stages that transform them, and the results emerge at the end. Each stage applies a specific operation — filtering, grouping, sorting, projecting fields, joining collections, or computing new values — and feeds its output into the next stage.
Unlike find() queries which return a flat set of matching documents, the aggregation framework can reshape data across dimensions: compute sums and averages, reshape document structures, unwind arrays into individual documents, join data from multiple collections, and build result sets that would otherwise require multiple queries and application-side logic.
When to Use Aggregation
Use the aggregation framework when you need:
- Computed statistics: totals, averages, minimums, maximums across groups
- Data reshaping: transforming nested documents, renaming fields, computing derived values
- Multi-collection joins: combining data from related collections without multiple queries
- Time-series analysis: bucketing documents by date ranges and computing rolling aggregates
- Faceted search: computing multiple aggregations over the same input in a single pass
- Conditional logic: applying if-then-else transformations within the pipeline
- Large result processing: streaming transformed results to disk or another collection
Prerequisites
- A running MongoDB instance (local or Atlas)
- The
mongoshell ormongosh - A collection with sample data to experiment on
- Basic knowledge of MongoDB CRUD operations and query syntax
The Pipeline Concept
A pipeline is a sequence of stages. Each stage is an object that describes a transformation. Documents flow through the pipeline one stage at a time, and each stage can reduce, increase, or reshape the document stream.
Think of an assembly line: raw materials (documents) enter at one end, pass through workstations (stages), and emerge as finished products (aggregated results). If a stage removes documents (like $match), downstream stages see fewer documents. If a stage produces more documents (like $unwind), downstream stages see more.
db.collection.aggregate([
{ $match: { status: "active" } },
{ $group: { _id: "$category", total: { $sum: "$amount" } } },
{ $sort: { total: -1 } },
{ $limit: 10 }
])
The pipeline array is passed to db.collection.aggregate(). Stages execute in the order they appear, and each stage can use any number of expressions — functions that operate on field values, literals, or the results of other expressions.
Stage Ordering Matters
Rearranging stages changes both results and performance. Place $match and $limit as early as possible to reduce the number of documents flowing through downstream stages. A $match at the beginning of a pipeline can use indexes, just like a find() query, while a $match in the middle of a pipeline cannot.
// Efficient: filter first, then transform
db.orders.aggregate([
{ $match: { date: { $gte: ISODate("2026-01-01") } } },
{ $group: { _id: "$product", total: { $sum: "$quantity" } } }
])
// Less efficient: transform all documents, then filter
db.orders.aggregate([
{ $group: { _id: "$product", total: { $sum: "$quantity" } } },
{ $match: { total: { $gte: 100 } } }
])
Core Stages
MongoDB provides over 30 aggregation stages. The most commonly used stages form the foundation of almost every pipeline.
$match
Filters documents, keeping only those that match the specified condition. Place $match as early as possible in the pipeline. It uses indexes and reduces the number of documents flowing into subsequent stages.
db.sales.aggregate([
{ $match: { region: "EMEA", amount: { $gte: 1000 } } }
])
$project
Passes along documents with only the specified fields. You can include fields, exclude fields, or create new computed fields.
db.sales.aggregate([
{ $project: {
_id: 0,
orderId: 1,
customerName: 1,
total: { $multiply: ["$price", "$quantity"] },
tax: { $multiply: ["$price", "$quantity", 0.08] }
}}
])
$group
Groups documents by a specified expression and applies accumulator expressions to each group. The _id field defines the group key.
db.sales.aggregate([
{ $group: {
_id: "$region",
totalSales: { $sum: "$amount" },
avgOrder: { $avg: "$amount" },
minOrder: { $min: "$amount" },
maxOrder: { $max: "$amount" },
orderCount: { $sum: 1 }
}}
])
$sort
Sorts documents by specified fields. Use 1 for ascending, -1 for descending. When $sort appears early in the pipeline and the sort key is indexed, MongoDB can avoid an in-memory sort.
db.sales.aggregate([
{ $sort: { total: -1, customerName: 1 } }
])
$limit
Passes only the first N documents to the next stage. Combine with $sort to implement top-N queries.
db.sales.aggregate([
{ $sort: { amount: -1 } },
{ $limit: 5 }
])
$skip
Skips the first N documents and passes the rest. Often used with $sort and $limit for pagination.
db.sales.aggregate([
{ $sort: { orderDate: -1 } },
{ $skip: 20 },
{ $limit: 10 }
])
$unwind
Deconstructs an array field, outputting one document per array element. This is one of the most powerful stages because it lets you treat array elements as individual documents for grouping and filtering.
db.orders.aggregate([
{ $unwind: "$items" },
{ $group: {
_id: "$items.productId",
totalSold: { $sum: "$items.quantity" }
}}
])
$lookup
Performs a left outer join with another collection. For each input document, $lookup adds an array field whose elements are matching documents from the “from” collection.
db.orders.aggregate([
{ $lookup: {
from: "products",
localField: "productId",
foreignField: "_id",
as: "productDetails"
}},
{ $unwind: "$productDetails" },
{ $project: {
orderDate: 1,
productName: "$productDetails.name",
price: "$productDetails.price"
}}
])
In MongoDB 5.0 and later, $lookup supports a pipeline syntax for more flexible joins:
db.orders.aggregate([
{ $lookup: {
from: "products",
let: { prodId: "$productId" },
pipeline: [
{ $match: { $expr: { $eq: ["$_id", "$$prodId"] } } },
{ $project: { name: 1, price: 1, category: 1 } }
],
as: "productDetails"
}}
])
$addFields
Adds new fields to documents without affecting existing fields. Useful for computed values, string manipulations, or date calculations that subsequent stages need.
db.sales.aggregate([
{ $addFields: {
totalWithTax: { $multiply: ["$amount", 1.08] },
orderYear: { $year: "$orderDate" },
orderMonth: { $month: "$orderDate" }
}}
])
$count
Returns a count of the remaining documents in the pipeline. Useful for quickly checking how many documents survive a series of stages.
db.sales.aggregate([
{ $match: { region: "APAC", amount: { $gte: 500 } } },
{ $count: "highValueApacOrders" }
])
// Result: { "highValueApacOrders": 1247 }
$out
Writes the pipeline results to a collection. If the collection already exists, $out replaces it atomically.
db.sales.aggregate([
{ $match: { year: 2026 } },
{ $group: { _id: "$region", total: { $sum: "$amount" } } },
{ $out: "regional_totals_2026" }
])
$merge
A more flexible alternative to $out. $merge can insert, merge, replace, keep existing, or fail when writing results to a collection. It can also output to a different database.
db.sales.aggregate([
{ $group: { _id: "$region", total: { $sum: "$amount" } } },
{ $merge: {
into: "region_totals",
on: "_id",
whenMatched: "merge",
whenNotMatched: "insert"
}}
])
Stage Reference Table
| Stage | Purpose | Syntax Example | Memory Limit |
|---|---|---|---|
$match |
Filter documents | { $match: { status: "active" } } |
100MB default per stage |
$project |
Reshape documents | { $project: { name: 1, total: 1 } } |
100MB default per stage |
$group |
Group by key, compute aggregates | { $group: { _id: "$city", avg: { $avg: "$age" } } } |
100MB default per stage |
$sort |
Order documents | { $sort: { price: -1 } } |
100MB default per stage |
$limit |
Pass first N documents | { $limit: 10 } |
None |
$skip |
Skip N documents | { $skip: 20 } |
None |
$unwind |
Deconstruct array | { $unwind: "$tags" } |
None |
$lookup |
Join with another collection | { $lookup: { from: "orders", ... } } |
None (uses indexes) |
$addFields |
Add computed fields | { $addFields: { total: { $add: ["$a", "$b"] } } } |
None |
$count |
Count remaining docs | { $count: "total" } |
None |
$out |
Write results to collection | { $out: "summary" } |
None |
$merge |
Merge into collection | { $merge: { into: "target" } } |
None |
$bucket |
Bucket documents into ranges | { $bucket: { groupBy: "$price", boundaries: [0,50,100] } } |
100MB default |
$facet |
Multi-faceted aggregation | { $facet: { stats: [...], top: [...] } } |
100MB per facet |
$replaceWith |
Replace doc with sub-doc | { $replaceWith: "$address" } |
None |
Stages marked with a 100MB memory limit throw an error if they exceed that limit. Use { allowDiskUse: true } to bypass the limit by writing temporary data to disk.
Expressions and Operators
Expressions are the building blocks of aggregation stages. They evaluate to a value and can reference field paths, literals, or other expressions.
Field Paths and Variables
$fieldName— references the value offieldNamein the current document$$CURRENT.fieldName— same as$fieldName, explicit syntax$$ROOT— references the root document$$NOW— the current datetime$$CLUSTER_TIME— the current cluster time
Accumulators (used in $group)
| Operator | Description | Example |
|---|---|---|
$sum |
Sum of values | { $sum: "$amount" } |
$avg |
Average of values | { $avg: "$score" } |
$min |
Minimum value | { $min: "$price" } |
$max |
Maximum value | { $max: "$price" } |
$push |
Array of all values | { $push: "$name" } |
$addToSet |
Array of unique values | { $addToSet: "$category" } |
$first |
First value in group | { $first: "$createdAt" } |
$last |
Last value in group | { $last: "$updatedAt" } |
$stdDevPop |
Population standard deviation | { $stdDevPop: "$score" } |
$stdDevSamp |
Sample standard deviation | { $stdDevSamp: "$score" } |
Arithmetic Expression Operators
{ $add: ["$price", "$tax"] }
{ $subtract: ["$total", "$discount"] }
{ $multiply: ["$quantity", "$unitPrice"] }
{ $divide: ["$total", "$count"] }
{ $mod: ["$amount", 100] }
{ $abs: "$netValue" }
{ $ceil: "$average" }
{ $floor: "$average" }
{ $round: ["$amount", 2] }
String Expression Operators
{ $concat: ["$firstName", " ", "$lastName"] }
{ $toUpper: "$name" }
{ $toLower: "$email" }
{ $substr: ["$phone", 0, 3] }
{ $trim: { input: "$username" } }
{ $split: ["$tags", ","] }
{ $regexMatch: { input: "$email", regex: /@company\.com$/ } }
Date Expression Operators
{ $year: "$orderDate" }
{ $month: "$orderDate" }
{ $dayOfMonth: "$orderDate" }
{ $dayOfWeek: "$orderDate" }
{ $hour: "$timestamp" }
{ $dateToString: { format: "%Y-%m-%d", date: "$orderDate" } }
{ $dateFromParts: { year: "$year", month: "$month", day: "$day" } }
Conditional Expression Operators
$cond evaluates a boolean expression and returns one of two values:
db.sales.aggregate([
{ $project: {
orderId: 1,
status: {
$cond: {
if: { $gte: ["$amount", 1000] },
then: "High Value",
else: "Standard"
}
}
}}
])
$switch handles multiple conditions:
db.sales.aggregate([
{ $project: {
orderId: 1,
tier: {
$switch: {
branches: [
{ case: { $gte: ["$amount", 10000] }, then: "Platinum" },
{ case: { $gte: ["$amount", 5000] }, then: "Gold" },
{ case: { $gte: ["$amount", 1000] }, then: "Silver" }
],
default: "Bronze"
}
}
}}
])
$ifNull returns a substitute value when a field is null or missing:
{ $ifNull: ["$discount", 0] }
Complete Pipeline Examples
Basic Pipeline: Top Products by Revenue
db.orders.aggregate([
{ $match: { status: "completed" } },
{ $unwind: "$items" },
{ $group: {
_id: "$items.productName",
revenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } },
unitsSold: { $sum: "$items.quantity" }
}},
{ $sort: { revenue: -1 } },
{ $limit: 10 },
{ $project: {
_id: 0,
product: "$_id",
revenue: { $round: ["$revenue", 2] },
unitsSold: 1
}}
])
Pipeline with $lookup and $unwind
db.orders.aggregate([
{ $match: { status: "shipped" } },
{ $lookup: {
from: "customers",
localField: "customerId",
foreignField: "_id",
as: "customer"
}},
{ $unwind: "$customer" },
{ $group: {
_id: "$customer.region",
totalRevenue: { $sum: "$total" },
topCustomer: { $max: "$customer.name" },
orderCount: { $sum: 1 }
}},
{ $sort: { totalRevenue: -1 } },
{ $out: "regional_revenue" }
])
Conditional Aggregation with $cond
db.sales.aggregate([
{ $group: {
_id: "$salesperson",
total: { $sum: "$amount" },
highValue: { $sum: { $cond: [{ $gte: ["$amount", 1000] }, 1, 0] } },
lowValue: { $sum: { $cond: [{ $lt: ["$amount", 1000] }, 1, 0] } }
}},
{ $project: {
salesperson: "$_id",
total: 1,
highValue: 1,
lowValue: 1,
highValueRate: {
$cond: [
{ $gt: ["$total", 0] },
{ $round: [{ $divide: ["$highValue", { $add: ["$highValue", "$lowValue"] }] }, 2] },
0
]
}
}}
])
Bucket Stage for Histograms
$bucket categorizes documents into ranges. This example creates a price histogram:
db.products.aggregate([
{ $bucket: {
groupBy: "$price",
boundaries: [0, 10, 25, 50, 100, 250, 500, 1000],
default: "Other",
output: {
count: { $sum: 1 },
avgPrice: { $avg: "$price" },
products: { $push: "$name" }
}
}}
])
Facet Stage for Multi-Faceted Analysis
$facet runs multiple sub-pipelines on the same set of input documents. Each sub-pipeline produces its own output array:
db.orders.aggregate([
{ $match: { status: "completed", date: { $gte: ISODate("2026-01-01") } } },
{ $facet: {
revenueByRegion: [
{ $group: { _id: "$region", revenue: { $sum: "$total" } } },
{ $sort: { revenue: -1 } }
],
topProducts: [
{ $unwind: "$items" },
{ $group: { _id: "$items.productName", sold: { $sum: "$items.quantity" } } },
{ $sort: { sold: -1 } },
{ $limit: 5 }
],
summary: [
{ $group: {
_id: null,
totalOrders: { $sum: 1 },
totalRevenue: { $sum: "$total" },
avgOrderValue: { $avg: "$total" }
}}
]
}}
])
The result is a single document with three keys, each containing an array:
{
"revenueByRegion": [ { "_id": "EMEA", "revenue": 452000 }, ... ],
"topProducts": [ { "_id": "Widget Pro", "sold": 1420 }, ... ],
"summary": [ { "_id": null, "totalOrders": 8453, "totalRevenue": 1892000, "avgOrderValue": 223.81 } ]
}
Aggregation Options
The aggregate() method accepts an optional second parameter — an options document:
db.collection.aggregate(pipeline, {
allowDiskUse: true,
explain: true,
collation: { locale: "en", strength: 2 },
hint: { status: 1, date: -1 },
bypassDocumentValidation: true,
comment: "monthly-report-q1-2026"
})
| Option | Type | Description |
|---|---|---|
allowDiskUse |
boolean | Enable disk-based sorting and grouping for stages exceeding 100MB memory |
explain |
boolean | Return query execution plan instead of results |
collation |
document | Specify locale rules for string comparison |
hint |
document | Force use of a specific index |
bypassDocumentValidation |
boolean | Skip document validation on write operations |
comment |
string | Attach a comment visible in logs and profiler |
allowDiskUse: true
Stages like $sort and $group have a 100MB memory limit. When this limit is reached, MongoDB returns an error unless allowDiskUse: true is specified, which spills temporary data to disk:
db.largeCollection.aggregate([
{ $group: { _id: "$field", total: { $sum: "$value" } } },
{ $sort: { total: -1 } }
], { allowDiskUse: true })
explain: true
Use explain to understand how MongoDB executes your pipeline, which indexes it uses, and how many documents each stage processes:
const explanation = db.orders.explain("executionStats").aggregate([
{ $match: { status: "completed" } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } }
])
Collation
Collation specifies language-specific rules for string comparison — case sensitivity, accent sensitivity, and locale:
db.products.aggregate([
{ $match: { name: { $gte: "apple", $lte: "banana" } } },
{ $group: { _id: "$category", count: { $sum: 1 } } }
], { collation: { locale: "en", strength: 1 } })
// strength: 1 ignores case and diacritics
Hint
Force the query optimizer to use a specific index:
db.orders.aggregate([
{ $match: { status: "pending" } },
{ $group: { _id: "$region", count: { $sum: 1 } } }
], { hint: { status: 1 } })
Pipeline Optimization
Projection Optimization
If a later stage only needs a subset of fields, project early to reduce the document size flowing through the pipeline:
// Without optimization: all fields pass through $match
db.users.aggregate([
{ $match: { status: "active" } },
{ $group: { _id: "$department", avg: { $avg: "$salary" } } }
])
// With optimization: discard unused fields before $group
db.users.aggregate([
{ $match: { status: "active" } },
{ $project: { department: 1, salary: 1, _id: 0 } },
{ $group: { _id: "$department", avg: { $avg: "$salary" } } }
])
Match Optimization
Place $match as early as possible. MongoDB can use indexes for $match only when it is the first stage in the pipeline:
// Optimal: $match uses index on status + date
db.orders.aggregate([
{ $match: { status: "completed", date: { $gte: ISODate("2026-01-01") } } },
{ $group: { _id: "$region", total: { $sum: "$amount" } } }
])
// Suboptimal: $sort executes before $match, no index benefit
db.orders.aggregate([
{ $sort: { amount: -1 } },
{ $match: { status: "completed" } }
])
Sort Optimization
When $sort immediately follows $match or appears early in the pipeline, MongoDB can use an index for the sort. If not, MongoDB performs an in-memory sort limited to 100MB by default:
// Can use index on { date: -1 }
db.events.aggregate([
{ $match: { type: "purchase" } },
{ $sort: { date: -1 } },
{ $limit: 50 }
])
// Cannot use index for sort (after $group)
db.events.aggregate([
{ $group: { _id: "$type", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
Limit and Sort Together
When $sort is followed by $limit, MongoDB optimizes this into a top-k sort that only keeps the K best results in memory:
// MongoDB's top-k optimization: only 10 results kept in memory during sort
db.products.aggregate([
{ $sort: { rating: -1 } },
{ $limit: 10 }
])
Comparison: Aggregation vs find()
| Feature | find() |
aggregate() |
|---|---|---|
| Simple filtering | Yes | Yes |
| Field projection | Yes (limited) | Yes (full) |
| Sorting | Yes | Yes |
| Grouping / aggregation | No | Yes |
| Joins | No ($lookup) |
Yes |
| Computed fields | No | Yes |
| Array unwinding | No | Yes |
| Conditional logic | No | Yes |
| Multi-stage transformations | No | Yes |
| Pipeline limit | N/A | 100MB per stage (default) |
| Results as cursor | Yes | Yes (v3.2+) |
| Results to collection | No | Yes ($out, $merge) |
When find() is Better
Use find() when you need a simple filter with a projection and sort. It is easier to write, supports cursor-based iteration efficiently, and integrates with the Query Planner for optimal index selection:
db.products.find(
{ category: "Electronics", price: { $lte: 100 } },
{ name: 1, price: 1, _id: 0 }
).sort({ price: 1 }).limit(20)
When Aggregation is Better
Use aggregate() when you need to group, compute, join, or reshape data. The pipeline model lets you compose complex transformations that would require multiple queries and application-side logic with find():
// This single pipeline replaces: a find, a group + reduce in app code,
// a join query, and a sort
db.orders.aggregate([
{ $match: { status: "completed" } },
{ $lookup: { from: "customers", localField: "customerId", foreignField: "_id", as: "customer" }},
{ $unwind: "$customer" },
{ $group: { _id: "$customer.region", total: { $sum: "$total" }, orders: { $push: "$_id" } }},
{ $sort: { total: -1 } }
])
Performance Tips
-
Position $match and $limit first: Reduces the number of documents entering subsequent stages. A first-stage
$matchcan use indexes. -
Use indexes for early stages: Create indexes that support your
$matchconditions and any early$sortoperations. -
Project early, project often: Remove unnecessary fields with
$projectas soon as possible to reduce memory pressure. -
Prefer $match over $project for filtering:
$projectcannot use indexes. Filter with$matchfirst, then transform with$project. -
Use allowDiskUse for large datasets: Without this flag, stages like
$sortand$groupfail when they exceed 100MB of RAM. -
Prefer $addFields over $project for adding fields:
$addFieldspreserves existing fields, so you don’t need to explicitly list every field you want to keep. -
**Use $lookup with indexes**: Ensure the foreign field in the joined collection is indexed. In MongoDB 5.0+, use the pipeline syntax with `$match` on indexed fields.
-
Avoid $unwind followed by $group when possible: If you only need aggregated values from array elements, consider using
$reduceor array expression operators instead, which avoid document multiplication. -
Monitor stage memory with explain(): Use
db.collection.explain("executionStats").aggregate(pipeline)to see how many documents each stage processes and whether indexes are being used. -
**Use $facet sparingly**: While powerful, `$facet` runs all sub-pipelines on the same input set, so expensive operations inside sub-pipelines multiply their cost. Each sub-pipeline also has its own 100MB memory limit.
Resources
- Aggregation Pipeline Quick Reference — official stage and operator reference
- Aggregation Pipeline Stages — complete stage documentation
- Aggregation Pipeline Optimization — MongoDB’s automatic optimizations
- Aggregation Pipeline Limits — memory limits and restrictions
- SQL to Aggregation Mapping Chart — translate SQL queries to aggregation pipelines
- M121: MongoDB Aggregation Framework — the original MongoDB University course
Comments