Database Manual / Aggregation Operations / Aggregation Pipeline

Aggregation Pipeline Optimization聚合管道优化

Aggregation pipeline operations have an optimization phase which attempts to reshape the pipeline for improved performance.聚合管道操作有一个优化阶段,试图重塑管道以提高性能。

To see how the optimizer transforms a particular aggregation pipeline, include the explain option in the db.collection.aggregate() method.要查看优化器如何转换特定的聚合管道,请在db.collection.aggregate()方法中包含explain选项。

Optimizations are subject to change between releases.优化可能会在不同版本之间发生变化。

In addition to learning about the aggregation pipeline optimizations performed during the optimization phase, you will also see how to improve aggregation pipeline performance using indexes and document filters.除了了解在优化阶段执行的聚合管道优化外,您还将了解如何使用索引和文档筛选器提高聚合管道性能。

You can run aggregation pipelines in the UI for deployments hosted in MongoDB Atlas.您可以在MongoDB Atlas中托管的部署的UI中运行聚合管道

Projection Optimization投影优化

The aggregation pipeline can determine if it requires only a subset of the fields in the documents to obtain the results. If so, the pipeline only uses those fields, reducing the amount of data passing through the pipeline.聚合管道可以确定它是否只需要文档中字段的一个子集来获得结果。如果是这样,管道只使用这些字段,从而减少了通过管道的数据量。

$project Stage Placement

When you use a $project stage it should typically be the last stage in your pipeline, used to specify which fields to return to the client.当您使用$project阶段时,它通常应该是管道中的最后一个阶段,用于指定要返回给客户端的字段。

Using a $project stage at the beginning or middle of a pipeline to reduce the number of fields passed to subsequent pipeline stages is unlikely to improve performance, because the database performs this optimization automatically.在管道的开始或中间使用$project阶段来减少传递给后续管道阶段的字段数量不太可能提高性能,因为数据库会自动执行此优化。

Pipeline Sequence Optimization管道顺序优化

($project or $unset or $addFields or $set) + $match Sequence Optimization序列优化

For an aggregation pipeline that contains a projection stage ($addFields, $project, $set, or $unset) followed by a $match stage, MongoDB moves any filters in the $match stage that do not require values computed in the projection stage to a new $match stage before the projection.对于包含一个投影阶段($addFields$project$set$unset)和一个$match阶段的聚合管道,MongoDB会将$match中不需要在投影阶段计算值的任何筛选器移动到投影前的新$match

If an aggregation pipeline contains multiple projection or $match stages, MongoDB performs this optimization for each $match stage, moving each $match filter before all projection stages that the filter does not depend on.如果聚合管道包含多个投影或$match阶段,MongoDB会对每个$match步骤执行此优化,将每个$match筛选器移动到筛选器不依赖的所有投影阶段之前。

Consider a pipeline with the following stages:考虑一个具有以下阶段的管道:

{
$addFields: {
maxTime: { $max: "$times" },
minTime: { $min: "$times" }
}
},
{
$project: {
_id: 1,
name: 1,
times: 1,
maxTime: 1,
minTime: 1,
avgTime: { $avg: ["$maxTime", "$minTime"] }
}
},
{
$match: {
name: "Joe Schmoe",
maxTime: { $lt: 20 },
minTime: { $gt: 5 },
avgTime: { $gt: 7 }
}
}

The optimizer breaks up the $match stage into four individual filters, one for each key in the $match query document. 优化器将$match阶段分解为四个单独的筛选器,每个筛选器对应$match查询文档中的每个键。The optimizer then moves each filter before as many projection stages as possible, creating new $match stages as needed.然后,优化器将每个筛选器移动到尽可能多的投影阶段之前,根据需要创建新的$match阶段。

Given this example, the optimizer automatically produces the following optimized pipeline:给定此示例,优化器会自动生成以下优化的管道:

{ $match: { name: "Joe Schmoe" } },
{ $addFields: {
maxTime: { $max: "$times" },
minTime: { $min: "$times" }
} },
{ $match: { maxTime: { $lt: 20 }, minTime: { $gt: 5 } } },
{ $project: {
_id: 1, name: 1, times: 1, maxTime: 1, minTime: 1,
avgTime: { $avg: ["$maxTime", "$minTime"] }
} },
{ $match: { avgTime: { $gt: 7 } } }

Note

The optimized pipeline is not intended to be run manually. The original and optimized pipelines return the same results.优化后的管道不打算手动运行。原始和优化的管道返回相同的结果。

You can see the optimized pipeline in the explain plan.您可以在解释计划中看到优化的管道。

The $match filter { avgTime: { $gt: 7 } } depends on the $project stage to compute the avgTime field. $match筛选器{ avgTime: { $gt: 7 } }依赖于$project阶段来计算avgTime字段。The $project stage is the last projection stage in this pipeline, so the $match filter on avgTime could not be moved.$project阶段是此管道中的最后一个投影阶段,因此无法移动avgTime上的$match筛选器。

The maxTime and minTime fields are computed in the $addFields stage but have no dependency on the $project stage. maxTimeminTime字段在$addFields阶段计算,但不依赖于$project阶段。The optimizer created a new $match stage for the filters on these fields and placed it before the $project stage.优化器为这些字段上的筛选器创建了一个新的$match阶段,并将其放置在$project阶段之前。

The $match filter { name: "Joe Schmoe" } does not use any values computed in either the $project or $addFields stages so it was moved to a new $match stage before both of the projection stages.$match筛选器{name:“Joe Schmoe”}不使用在$project$addFields阶段中计算的任何值,因此它被移动到两个投影阶段之前的新$match阶段。

After optimization, the filter { name: "Joe Schmoe" } is in a $match stage at the beginning of the pipeline. This has the added benefit of allowing the aggregation to use an index on the name field when initially querying the collection.优化后,筛选器{ name: "Joe Schmoe" }在管道开始时处于$match阶段。这还有一个额外的好处,即允许聚合在最初查询集合时在名称字段上使用索引。

$sort + $match Sequence Optimization序列优化

When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort. For example, if the pipeline consists of the following stages:当你有一个带有$sort$match的序列时,$match会移动到$sort之前,以尽量减少要排序的对象数量。例如,如果管道由以下阶段组成:

{ $sort: { age : -1 } },
{ $match: { status: 'A' } }

During the optimization phase, the optimizer transforms the sequence to the following:在优化阶段,优化器将序列转换为以下内容:

{ $match: { status: 'A' } },
{ $sort: { age : -1 } }

$redact + $match Sequence Optimization序列优化

When possible, when the pipeline has the $redact stage immediately followed by the $match stage, the aggregation can sometimes add a portion of the $match stage before the $redact stage. 在可能的情况下,当管道中紧接着$redact阶段和$match阶段时,聚合有时可以在$redact步骤之前添加一部分$match步骤。If the added $match stage is at the start of a pipeline, the aggregation can use an index as well as query the collection to limit the number of documents that enter the pipeline. See Improve Performance with Indexes and Document Filters for more information.如果添加的$match阶段位于管道的开头,则聚合可以使用索引以及查询集合来限制进入管道的文档数量。有关更多信息,请参阅使用索引和文档筛选器提高性能。

For example, if the pipeline consists of the following stages:例如,如果管道由以下阶段组成:

{ $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } },
{ $match: { year: 2014, category: { $ne: "Z" } } }

The optimizer can add the same $match stage before the $redact stage:优化器可以在$redact阶段之前添加相同的$match阶段:

{ $match: { year: 2014 } },
{ $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } },
{ $match: { year: 2014, category: { $ne: "Z" } } }

$project/$unset + $skip Sequence Optimization序列优化

When you have a sequence with $project or $unset followed by $skip, the $skip moves before $project. For example, if the pipeline consists of the following stages:当你有一个序列,其中$project$unset后跟$skip时,$skip会移动到$project之前。例如,如果管道由以下阶段组成:

{ $sort: { age : -1 } },
{ $project: { status: 1, name: 1 } },
{ $skip: 5 }

During the optimization phase, the optimizer transforms the sequence to the following:在优化阶段,优化器将序列转换为以下内容:

{ $sort: { age : -1 } },
{ $skip: 5 },
{ $project: { status: 1, name: 1 } }

Pipeline Coalescence Optimization管道聚结优化

When possible, the optimization phase coalesces a pipeline stage into its predecessor. Generally, coalescence occurs after any sequence reordering optimization.在可能的情况下,优化阶段将管道阶段合并到其前身中。通常,合并发生在任何序列重新排序优化之后。

$sort + $limit Coalescence聚结

When a $sort precedes a $limit, the optimizer can coalesce the $limit into the $sort if no intervening stages modify the number of documents (e.g. $unwind, $group). $sort位于$limit之前时,如果没有中间阶段修改文档数量(例如$unwill$group),优化器可以将$limit合并到$sort中。MongoDB will not coalesce the $limit into the $sort if there are pipeline stages that change the number of documents between the $sort and $limit stages.如果存在在$sort$limit阶段之间改变文档数量的管道阶段,MongoDB将不会将$limit合并到$sort中。

For example, if the pipeline consists of the following stages:例如,如果管道由以下阶段组成:

{ $sort : { age : -1 } },
{ $project : { age : 1, status : 1, name : 1 } },
{ $limit: 5 }

During the optimization phase, the optimizer coalesces the sequence to the following:在优化阶段,优化器将序列合并为以下内容:

{
"$sort" : {
"sortKey" : {
"age" : -1
},
"limit" : Long(5)
}
},
{ "$project" : {
"age" : 1,
"status" : 1,
"name" : 1
}
}

This allows the sort operation to only maintain the top n results as it progresses, where n is the specified limit, and MongoDB only needs to store n items in memory [1]. 这允许排序操作在进行过程中只维护前n个结果,其中n是指定的限制,MongoDB只需要在内存中存储n个项目[1]See $sort Operator and Memory for more information.有关更多信息,请参阅$sort运算符和内存。

Note

Sequence Optimization with $skip使用$skip进行序列优化

If there is a $skip stage between the $sort and $limit stages, MongoDB will coalesce the $limit into the $sort stage and increase the $limit value by the $skip amount. See $sort + $skip + $limit Sequence for an example.如果在$sort$limit阶段之间存在$skip阶段,MongoDB将把$limit合并到$sort阶段,并将$limit值增加$skip金额。请参阅$sort+$skip+$limit序列示例

[1] The optimization will still apply when allowDiskUse is true and the n items exceed the aggregation memory limit.allowDiskUse为真并且n个项目超过聚合内存限制时,优化仍将适用。

$limit + $limit Coalescence聚结

When a $limit immediately follows another $limit, the two stages can coalesce into a single $limit where the limit amount is the smaller of the two initial limit amounts. For example, a pipeline contains the following sequence:$limit紧随另一个$limit之后时,这两个阶段可以合并为一个单独的$limit,其中限额金额是两个初始限额金额中的较小者。例如,管道包含以下序列:

{ $limit: 100 },
{ $limit: 10 }

Then the second $limit stage can coalesce into the first $limit stage and result in a single $limit stage where the limit amount 10 is the minimum of the two initial limits 100 and 10.然后,第二个$limit制阶段可以合并为第一个$limit步骤,并产生一个单一的$limit阶段,其中限制量10是两个初始限制10010中的最小值。

{ $limit: 10 }

$skip + $skip Coalescence聚结

When a $skip immediately follows another $skip, the two stages can coalesce into a single $skip where the skip amount is the sum of the two initial skip amounts. For example, a pipeline contains the following sequence:$skip紧接着另一个$skip时,这两个阶段可以合并为一个$skip,其中跳过量是两个初始跳过量的总和。例如,管道包含以下序列:

{ $skip: 5 },
{ $skip: 2 }

Then the second $skip stage can coalesce into the first $skip stage and result in a single $skip stage where the skip amount 7 is the sum of the two initial limits 5 and 2.然后,第二跳过阶段可以合并为第一跳过阶段,并产生一个跳过阶段,其中跳过量7是两个初始限制52的总和。

{ $skip: 7 }

$match + $match Coalescence聚结

When a $match immediately follows another $match, the two stages can coalesce into a single $match combining the conditions with an $and. For example, a pipeline contains the following sequence:当一个$match紧接着另一个$game时,这两个阶段可以合并成一个单一的$match,将条件与$and结合起来。例如,管道包含以下序列:

{ $match: { year: 2014 } },
{ $match: { status: "A" } }

Then the second $match stage can coalesce into the first $match stage and result in a single $match stage然后,第二个$match阶段可以合并到第一个$match阶段,并形成一个单一的$match步骤

{ $match: { $and: [ { "year" : 2014 }, { "status" : "A" } ] } }

$lookup, $unwind, and $match Coalescence聚结

When $unwind immediately follows $lookup, and the $unwind operates on the as field of the $lookup, the optimizer coalesces the $unwind into the $lookup stage. This avoids creating large intermediate documents. $express紧随$lookup之后,并且$express$lookup的as字段上操作时,优化器会将$express合并到$lookup阶段。这避免了创建大型中间文档。Furthermore, if $unwind is followed by a $match on any as subfield of the $lookup, the optimizer also coalesces the $match.此外,如果$unwill后面是$lookup的任何子字段上的$match,优化器也会合并$match

For example, a pipeline contains the following sequence:例如,管道包含以下序列:

{
$lookup: {
from: "otherCollection",
as: "resultingArray",
localField: "x",
foreignField: "y"
}
},
{ $unwind: "$resultingArray" },
{ $match: {
"resultingArray.foo": "bar"
}
}

The optimizer coalesces the $unwind and $match stages into the $lookup stage. If you run the aggregation with explain option, the explain output shows the coalesced stages:优化器将$unflew$match阶段合并到$lookup阶段。如果使用解释选项运行聚合,解释输出将显示合并的阶段:

{
$lookup: {
from: "otherCollection",
as: "resultingArray",
localField: "x",
foreignField: "y",
let: {},
pipeline: [
{
$match: {
"foo": {
"$eq": "bar"
}
}
}
],
unwinding: {
"preserveNullAndEmptyArrays": false
}
}
}

You can see this optimized pipeline in the explain plan.您可以在解释计划中看到这个优化的管道。

The unwinding field shown in the previous explain output differs from the $unwind stage. The unwinding field shows how the pipeline is internally optimized. 前一explain输出中显示的unwinding字段与$unwind阶段不同。unwinding字段显示了管道是如何进行内部优化的。The $unwind stage deconstructs an array field from the input documents and outputs a document for each element.$unwind阶段从输入文档中解构一个数组字段,并为每个元素输出一个文档。

Slot-Based Query Execution Engine Pipeline Optimizations基于插槽的查询执行引擎流水线优化

MongoDB can use the slot-based query execution engine to execute certain pipeline stages when specific conditions are met. In most cases, the slot-based execution engine provides improved performance and lower CPU and memory costs compared to the classic query engine.MongoDB可以在满足特定条件时使用基于槽的查询执行引擎来执行某些流水线阶段。在大多数情况下,与经典查询引擎相比,基于槽的执行引擎提供了更高的性能和更低的CPU和内存成本。

To verify that the slot-based execution engine is used, run the aggregation with the explain option. 要验证是否使用了基于插槽的执行引擎,请使用explain选项运行聚合。This option outputs information on the aggregation's query plan. For more information on using explain with aggregations, see Return Information on Aggregation Pipeline Operation.此选项输出有关聚合查询计划的信息。有关在聚合中使用explain的更多信息,请参阅聚合管道操作的返回信息

The following sections describe:以下部分描述了:

  • The conditions when the slot-based execution engine is used for aggregation.使用基于槽的执行引擎进行聚合的条件。
  • How to verify if the slot-based execution engine was used.如何验证是否使用了基于插槽的执行引擎。

$group Optimization优化

New in version 5.2.在版本5.2中新增。

Starting in version 5.2, MongoDB uses the slot-based execution query engine to execute $group stages if either:从5.2版本开始,MongoDB使用基于槽的执行查询引擎来执行$group阶段,如果出现以下情况之一:

  • $group is the first stage in the pipeline.这是筹备工作的第一阶段。
  • All preceding stages in the pipeline can also be executed by the slot-based execution engine.流水线中的所有先前阶段也可以由基于槽的执行引擎执行。

When the slot-based query execution engine is used for $group, the explain results include queryPlanner.winningPlan.queryPlan.stage: "GROUP".当基于槽的查询执行引擎用于$group时,解释结果包括queryPlanner.winningPlan.queryPlan.stage: "GROUP"

The location of the queryPlanner object depends on whether the pipeline contains stages after the $group stage that cannot be executed using the slot-based execution engine.queryPlanner对象的位置取决于管道是否包含$group阶段之后的阶段,这些阶段无法使用基于插槽的执行引擎执行。

  • If $group is the last stage or all stages after $group can be executed using the slot-based execution engine, the queryPlanner object is in the top-level explain output object (explain.queryPlanner).如果$group是最后一个阶段,或者可以使用基于槽的执行引擎执行$group之后的所有阶段,则queryPlanner对象位于顶级explain输出对象(explain.queryPlanner)中。
  • If the pipeline contains stages after $group that cannot be executed using the slot-based execution engine, the queryPlanner object is in explain.stages[0].$cursor.queryPlanner.如果管道在$group之后包含无法使用基于槽的执行引擎执行的阶段,则queryPlanner对象位于explain.stages[0].$cursor.queryPlanner中。

$lookup Optimization优化

New in version 6.0.在版本6.0中新增。

Starting in version 6.0, MongoDB can use the slot-based execution query engine to execute $lookup stages if all preceding stages in the pipeline can also be executed by the slot-based execution engine and none of the following conditions are true:从6.0版本开始,如果管道中的所有前面阶段也可以由基于槽的执行引擎执行,并且以下条件均不成立,MongoDB可以使用基于槽的运行查询引擎来执行$lookup阶段:

  • The $lookup operation executes a pipeline on a foreign collection. To see an example of this kind of operation, see Join Conditions and Subqueries on a Foreign Collection.$lookup操作在外部集合上执行管道。要查看此类操作的示例,请参阅外部托收的联接条件和子查询
  • The $lookup's localField or foreignField specify numeric components. For example: { localField: "restaurant.0.review" }.$lookuplocalFieldforeignField指定了数字组件。例如:{ localField: "restaurant.0.review" }
  • The from field of any $lookup in the pipeline specifies a view or sharded collection.管道中任何$lookupfrom字段都指定了一个视图或分片集合。

When the slot-based query execution engine is used for $lookup, the explain results include queryPlanner.winningPlan.queryPlan.stage: "EQ_LOOKUP". EQ_LOOKUP means "equality lookup".当基于槽的查询执行引擎用于$lookup时,解释结果包括queryPlanner.winningPlan.queryPlan.stage: "EQ_LOOKUP"EQ_LOOKUP的意思是“相等查找”。

The location of the queryPlanner object depends on whether the pipeline contains stages after the $lookup stage that cannot be executed using the slot-based execution engine.queryPlanner对象的位置取决于管道在$lookup阶段之后是否包含无法使用基于槽的执行引擎执行的阶段。

  • If $lookup is the last stage or all stages after $lookup can be executed using the slot-based execution engine, the queryPlanner object is in the top-level explain output object (explain.queryPlanner).如果$lookup是最后一个阶段,或者$lookup之后的所有阶段都可以使用基于槽的执行引擎执行,则queryPlanner对象位于顶级explain输出对象(explain.queryPlanner)中。
  • If the pipeline contains stages after $lookup that cannot be executed using the slot-based execution engine, the queryPlanner object is in explain.stages[0].$cursor.queryPlanner.如果管道包含$lookup之后的阶段,而这些阶段无法使用基于槽的执行引擎执行,则queryPlanner对象位于explain.stages[0].$cursor.queryPlanner中。

Improve Performance with Indexes and Document Filters使用索引和文档筛选器提高性能

The following sections show how you can improve aggregation performance using indexes and document filters.以下部分展示了如何使用索引和文档筛选器提高聚合性能。

Indexes索引

An aggregation pipeline can use indexes from the input collection to improve performance. Using an index limits the amount of documents a stage processes. Ideally, an index can cover the stage query. 聚合管道可以使用输入集合中的索引来提高性能。使用索引限制了一个阶段处理的文档数量。理想情况下,索引可以覆盖阶段查询。A covered query has especially high performance, since the index returns all matching documents.覆盖查询具有特别高的性能,因为索引返回所有匹配的文档。

For example, a pipeline that consists of $match, $sort, $group can benefit from indexes at every stage:例如,由$match$sort$group组成的管道可以从每个阶段的索引中受益:

  • An index on the $match query field efficiently identifies the relevant data$match查询字段上的索引有效地标识了相关数据
  • An index on the sorting field returns data in sorted order for the $sort stage排序字段上的索引按$sort阶段的排序顺序返回数据
  • An index on the grouping field that matches the $sort order returns all of the field values needed for the $group stage, making it a covered query.分组字段上与$sort顺序匹配的索引返回$group阶段所需的所有字段值,使其成为覆盖查询。

To determine whether a pipeline uses indexes, review the query plan and look for IXSCAN or DISTINCT_SCAN plans.要确定管道是否使用索引,请查看查询计划并查找IXSCANDISTINCT_SCAN计划。

Note

In some cases, the query planner uses a DISTINCT_SCAN index plan that returns one document per index key value. DISTINCT_SCAN executes faster than IXSCAN if there are multiple documents per key value. However, index scan parameters might affect the time comparison of DISTINCT_SCAN and IXSCAN.在某些情况下,查询计划器使用DISTINCT_SCAN索引计划,该计划为每个索引键值返回一个文档。如果每个键值有多个文档,DISTINCT_SCAN的执行速度比IXSCAN快。然而,索引扫描参数可能会影响DISTINCT_SCANIXSCAN的时间比较。

For early stages in your aggregation pipeline, consider indexing the query fields. Stages that can benefit from indexes are:对于聚合管道的早期阶段,考虑对查询字段进行索引。可以从索引中受益的阶段包括:

$match stage阶段
During the $match stage, the server can use an index if $match is the first stage in the pipeline, after any optimizations from the query planner.$match阶段,如果$match是管道中的第一阶段,在查询计划器进行任何优化后,服务器可以使用索引。
$sort stage阶段
During the $sort stage, the server can use an index if the stage is not preceded by a $project, $unwind, or $group stage.$sort阶段,如果该阶段之前没有$project$unflew$group阶段,服务器可以使用索引。
$group stage阶段

During the $group stage, the server can use an index to quickly find the $first or $last document in each group if the stage meets both of these conditions:$group阶段,如果该阶段满足以下两个条件,服务器可以使用索引快速找到每个组中的$first$last文档:

  • The pipeline sorts and groups by the same field.管道按同一字段进行排序和分组。
  • The $group stage only uses the $first or $last accumulator operator.$group阶段只使用$first$last累加器运算符。

See $group Performance Optimizations for an example.请参阅$group性能优化以获取示例。

$geoNear stage阶段
The server always uses an index for the $geoNear stage, since it requires a geospatial index.服务器始终为$geoNear阶段使用索引,因为它需要地理空间索引

Additionally, stages later in the pipeline that retrieve data from other, unmodified collections can use indexes on those collections for optimization. These stages include:此外,管道中从其他未修改的集合检索数据的后期阶段可以使用这些集合上的索引进行优化。这些阶段包括:

Document Filters文档筛选器

If your aggregation operation requires only a subset of the documents in a collection, filter the documents first:如果聚合操作只需要集合中文档的一个子集,请先筛选文档:

  • Use the $match, $limit, and $skip stages to restrict the documents that enter the pipeline.使用$match$limit$skip阶段来限制进入管道的文档。
  • When possible, put $match at the beginning of the pipeline to use indexes that scan the matching documents in a collection.如果可能,将$match放在管道的开头,以使用扫描集合中匹配文档的索引。
  • $match followed by $sort at the start of the pipeline is equivalent to a single query with a sort, and can use an index.管道开头的$match后跟$sort相当于一个带有排序的查询,可以使用索引。

Example示例

$sort + $skip + $limit Sequence序列

A pipeline contains a sequence of $sort followed by a $skip followed by a $limit:管道包含一个$sort$skip$limit的序列:

{ $sort: { age : -1 } },
{ $skip: 10 },
{ $limit: 5 }

The optimizer performs $sort + $limit Coalescence to transforms the sequence to the following:优化器执行$sort+$limit聚结将序列转换为以下内容:

{
"$sort" : {
"sortKey" : {
"age" : -1
},
"limit" : Long(15)
}
},
{
"$skip" : Long(10)
}

MongoDB increases the $limit amount with the reordering.MongoDB通过重新排序增加了$limit金额。