$bucketAuto (aggregation)

On this page本页内容

Definition定义

$bucketAuto

Categorizes incoming documents into a specific number of groups, called buckets, based on a specified expression. 根据指定的表达式,将传入文档分类为特定数量的组,称为bucket。Bucket boundaries are automatically determined in an attempt to evenly distribute the documents into the specified number of buckets.桶边界自动确定,以尝试将文档均匀分布到指定数量的桶中。

Each bucket is represented as a document in the output. 每个bucket在输出中表示为一个文档。The document for each bucket contains:每个桶的文档包含:

  • An _id object that specifies the bounds of the bucket.指定桶边界的_id.min对象。

    • The _id.min field specifies the inclusive lower bound for the bucket._idmin字段指定桶的包含下限。
    • The _id.max field specifies the upper bound for the bucket. _id.max字段指定桶的上限。This bound is exclusive for all buckets except the final bucket in the series, where it is inclusive.该边界对除系列中的最后一个桶之外的所有桶都是排他性的,其中它是包含的。
  • A count field that contains the number of documents in the bucket. 包含存储桶中文档数量的count字段。The count field is included by default when the output document is not specified.未指定output文档时,默认情况下包括count字段。

The $bucketAuto stage has the following form:$bucketAuto阶段具有以下形式:

{
  $bucketAuto: {
      groupBy: <expression>,
      buckets: <number>,
      output: {
         <output1>: { <$accumulator expression> },
         ...
      }
      granularity: <string>
  }
}
Field字段Type类型Description描述
groupByexpressionAn expression to group documents by. 用于将文档分组的表达式To specify a field path, prefix the field name with a dollar sign $ and enclose it in quotes.若要指定字段路径,请在字段名称前面加上美元符号$,并将其括在引号中。
bucketsintegerA positive 32-bit integer that specifies the number of buckets into which input documents are grouped.一个正32位整数,指定输入文档分组到的存储桶数。
outputdocument

Optional. 可选。A document that specifies the fields to include in the output documents in addition to the _id field. 一种文档,它指定了除_id字段外还包括在输出文档中的字段。To specify the field to include, you must use accumulator expressions:要指定要包含的字段,必须使用累加器表达式

<outputfield1>: { <accumulator>: <expression1> },
...

The default count field is not included in the output document when output is specified. 指定output时,输出文档中不包括默认count字段。Explicitly specify the count expression as part of the output document to include it:显式指定count表达式作为输出文档的一部分,以将其包括在内:

output: {
  <outputfield1>: { <accumulator>: <expression1> },
  ...
  count: { $sum: 1 }
}
granularitystring

Optional. 可选。A string that specifies the preferred number series to use to ensure that the calculated boundary edges end on preferred round numbers or their powers of 10.一个字符串,用于指定要使用的首选数列,以确保计算的边界边以首选整数或其10的幂结束。

Available only if the all groupBy values are numeric and none of them are NaN.仅当所有groupBy值均为数值且均为NaN时可用。

The suppported values of granularity are:支持的granularity值为:

  • "R5"
  • "R10"
  • "R20"
  • "R40"
  • "R80"
  • "1-2-5"
  • "E6"
  • "E12"
  • "E24"
  • "E48"
  • "E96"
  • "E192"
  • "POWERSOF2"

Considerations注意事项

$bucketAuto and Memory Restrictions和内存限制

The $bucketAuto stage has a limit of 100 megabytes of RAM. $bucketAuto阶段的RAM限制为100兆字节。By default, if the stage exceeds this limit, $bucketAuto returns an error. 默认情况下,如果阶段超过此限制,$bucketAuto将返回一个错误。To allow more space for stage processing, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.要为阶段处理留出更多空间,请使用allowDiskUse选项启用聚合管道阶段将数据写入临时文件。

Tip提示

Behavior行为

There may be less than the specified number of buckets if:如果出现以下情况,则可能少于指定的桶数:

  • The number of input documents is less than the specified number of buckets.输入文档的数量小于指定的桶数。
  • The number of unique values of the groupBy expression is less than the specified number of buckets.groupBy表达式的唯一值数小于指定的buckets的数量。
  • The granularity has fewer intervals than the number of buckets.granularity的间隔小于buckets的数量。
  • The granularity is not fine enough to evenly distribute documents into the specified number of buckets.granularity不够精细,无法将文档均匀分布到指定数量的buckets中。

If the groupBy expression refers to an array or document, the values are arranged using the same ordering as in $sort before determining the bucket boundaries.如果groupBy表达式引用数组或文档,则在确定桶边界之前,使用与$sort相同的顺序排列值。

The even distribution of documents across buckets depends on the cardinality, or the number of unique values, of the groupBy field. 文档跨存储桶的均匀分布取决于groupBy字段的基数或唯一值的数量。If the cardinality is not high enough, the $bucketAuto stage may not evenly distribute the results across buckets.如果基数不够高,则$bucketAuto阶段可能无法将结果均匀分布到各个桶。

Granularity粒度

The $bucketAuto accepts an optional granularity parameter which ensures that the boundaries of all buckets adhere to a specified preferred number series. $bucketAuto接受一个可选的granularity参数,该参数确保所有桶的边界符合指定的首选数字序列Using a preferred number series provides more control on where the bucket boundaries are set among the range of values in the groupBy expression. 使用优选的数字序列提供了对在groupBy表达式中的值范围中设置桶边界的更多控制。They may also be used to help logarithmically and evenly set bucket boundaries when the range of the groupBy expression scales exponentially.groupBy表达式的范围按指数缩放时,它们还可用于帮助对数和均匀地设置桶边界。

Renard Series雷纳德系列

The Renard number series are sets of numbers derived by taking either the 5 th, 10 th, 20 th, 40 th, or 80 th root of 10, then including various powers of the root that equate to values between 1.0 to 10.0 (10.3 in the case of R80).雷诺数系列是通过取10的第5、第10、第20、第40或第80个根,然后包括等于1.0到10.0之间的值(R80情况下为10.3)的根的各种幂得出的数集。

Set granularity to R5, R10, R20, R40, or R80 to restrict bucket boundaries to values in the series. granularity设置为R5R10R20R40R80,以将桶边界限制为系列中的值。The values of the series are multiplied by a power of 10 when the groupBy values are outside of the 1.0 to 10.0 (10.3 for R80) range.groupBy值在1.0到10.0(R80为10.3)范围之外时,该系列的值乘以10的幂。

Example示例

The R5 series is based off of the fifth root of 10, which is 1.58, and includes various powers of this root (rounded) until 10 is reached. R5系列基于10的五次方根,即1.58,并包括该方根的各种幂(四舍五入),直到达到10。The R5 series is derived as follows:R5系列推导如下:

  • 10 0/5 = 1
  • 10 1/5 = 1.584 ~ 1.6
  • 10 2/5 = 2.511 ~ 2.5
  • 10 3/5 = 3.981 ~ 4.0
  • 10 4/5 = 6.309 ~ 6.3
  • 10 5/5 = 10

The same approach is applied to the other Renard series to offer finer granularity, i.e., more intervals between 1.0 and 10.0 (10.3 for R80).同样的方法适用于其他Renard系列,以提供更精细的粒度,即1.0和10.0之间的更多间隔(R80为10.3)。

E SeriesE数列

The E number series are similar to the Renard series in that they subdivide the interval from 1.0 to 10.0 by the 6 th, 12 th, 24 th, 48 th, 96 th, or 192 nd root of ten with a particular relative error.E数列与Renard数列相似,因为它们将1.0到10.0之间的间隔细分为10的第6、12、24、48、96或192次方根,并具有特定的相对误差。

Set granularity to E6, E12, E24, E48, E96, or E192 to restrict bucket boundaries to values in the series. granularity设置为E6E12E24E48E96E192,以将桶边界限制为系列中的值。The values of the series are multiplied by a power of 10 when the groupBy values are outside of the 1.0 to 10.0 range. groupBy值在1.0到10.0范围之外时,序列的值乘以10的幂。To learn more about the E-series and their respective relative errors, see preferred number series.要了解有关E系列及其各自相对误差的更多信息,请参阅首选数字系列

1-2-5 Series

The 1-2-5 series behaves like a three-value Renard series, if such a series existed.1-2-5系列的行为类似于三值Renard系列(如果存在此类系列)。

Set granularity to 1-2-5 to restrict bucket boundaries to various powers of the third root of 10, rounded to one significant digit.granularity设置为1-2-5,将桶边界限制为10的第三个根的各种幂,四舍五入到一个有效数字。

Example示例

The following values are part of the 1-2-5 series: 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, and so on...以下值是1-2-5系列的一部分:0.1、0.2、0.5、1、2、5、10、20、50、100、200、500、1000等。。。

Powers of Two Series二级数的幂

Set granularity to POWERSOF2 to restrict bucket boundaries to numbers that are a power of two.granularity设置为POWERSOF2,将桶边界限制为2的幂。

Example实例

The following numbers adhere to the power of two Series:以下数字符合两个系列的幂:

  • 2 0 = 1
  • 2 1 = 2
  • 2 2 = 4
  • 2 3 = 8
  • 2 4 = 16
  • 2 5 = 32
  • and so on...

A common implementation is how various computer components, like memory, often adhere to the POWERSOF2 set of preferred numbers:一种常见的实现方式是,各种计算机组件(如内存)通常遵循2的幂组首选数字:

1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and so on....1、2、4、8、16、32、64、128、256、512、1024、2048……

Comparing Different Granularities比较不同粒度

The following operation demonstrates how specifying different values for granularity affects how $bucketAuto determines bucket boundaries. 以下操作演示了为granularity指定不同的值如何影响$bucketAuto如何确定桶边界。A collection of things have an _id numbered from 1 to 100:things集合具有编号为1到100的_id

{ _id: 1 }
{ _id: 2 }
...
{ _id: 100 }

Different values for granularity are substituted into the following operation:granularity的不同值被替换为以下操作:

db.things.aggregate( [
  {
    $bucketAuto: {
      groupBy: "$_id",
      buckets: 5,
      granularity: <granularity>
    }
  }
] )

The results in the following table demonstrate how different values for granularity yield different bucket boundaries:下表中的结果演示了granularity的不同值如何产生不同的桶边界:

GranularityResultsNotes
No granularity
{ "_id" : { "min" : 0, "max" : 20 }, "count" : 20 }
{ "_id" : { "min" : 20, "max" : 40 }, "count" : 20 }
{ "_id" : { "min" : 40, "max" : 60 }, "count" : 20 }
{ "_id" : { "min" : 60, "max" : 80 }, "count" : 20 }
{ "_id" : { "min" : 80, "max" : 99 }, "count" : 20 }
R20
{ "_id" : { "min" : 0, "max" : 20 }, "count" : 20 }
{ "_id" : { "min" : 20, "max" : 40 }, "count" : 20 }
{ "_id" : { "min" : 40, "max" : 63 }, "count" : 23 }
{ "_id" : { "min" : 63, "max" : 90 }, "count" : 27 }
{ "_id" : { "min" : 90, "max" : 100 }, "count" : 10 }
E24
{ "_id" : { "min" : 0, "max" : 20 }, "count" : 20 }
{ "_id" : { "min" : 20, "max" : 43 }, "count" : 23 }
{ "_id" : { "min" : 43, "max" : 68 }, "count" : 25 }
{ "_id" : { "min" : 68, "max" : 91 }, "count" : 23 }
{ "_id" : { "min" : 91, "max" : 100 }, "count" : 9 }
1-2-5
{ "_id" : { "min" : 0, "max" : 20 }, "count" : 20 }
{ "_id" : { "min" : 20, "max" : 50 }, "count" : 30 }
{ "_id" : { "min" : 50, "max" : 100 }, "count" : 50 }
The specified number of buckets exceeds the number of intervals in the series.指定的桶数超过了系列中的间隔数。
POWERSOF2
{ "_id" : { "min" : 0, "max" : 32 }, "count" : 32 }
{ "_id" : { "min" : 32, "max" : 64 }, "count" : 32 }
{ "_id" : { "min" : 64, "max" : 128 }, "count" : 36 }
The specified number of buckets exceeds the number of intervals in the series.指定的桶数超过了系列中的间隔数。

Example示例

Consider a collection artwork with the following documents:考虑一个artwork集合包含以下文档:

{ "_id" : 1, "title" : "The Pillars of Society", "artist" : "Grosz", "year" : 1926,
    "price" : NumberDecimal("199.99"),
    "dimensions" : { "height" : 39, "width" : 21, "units" : "in" } }
{ "_id" : 2, "title" : "Melancholy III", "artist" : "Munch", "year" : 1902,
    "price" : NumberDecimal("280.00"),
    "dimensions" : { "height" : 49, "width" : 32, "units" : "in" } }
{ "_id" : 3, "title" : "Dancer", "artist" : "Miro", "year" : 1925,
    "price" : NumberDecimal("76.04"),
    "dimensions" : { "height" : 25, "width" : 20, "units" : "in" } }
{ "_id" : 4, "title" : "The Great Wave off Kanagawa", "artist" : "Hokusai",
    "price" : NumberDecimal("167.30"),
    "dimensions" : { "height" : 24, "width" : 36, "units" : "in" } }
{ "_id" : 5, "title" : "The Persistence of Memory", "artist" : "Dali", "year" : 1931,
    "price" : NumberDecimal("483.00"),
    "dimensions" : { "height" : 20, "width" : 24, "units" : "in" } }
{ "_id" : 6, "title" : "Composition VII", "artist" : "Kandinsky", "year" : 1913,
    "price" : NumberDecimal("385.00"),
    "dimensions" : { "height" : 30, "width" : 46, "units" : "in" } }
{ "_id" : 7, "title" : "The Scream", "artist" : "Munch",
    "price" : NumberDecimal("159.00"),
    "dimensions" : { "height" : 24, "width" : 18, "units" : "in" } }
{ "_id" : 8, "title" : "Blue Flower", "artist" : "O'Keefe", "year" : 1918,
    "price" : NumberDecimal("118.42"),
    "dimensions" : { "height" : 24, "width" : 20, "units" : "in" } }

Single Facet Aggregation单面聚合

In the following operation, input documents are grouped into four buckets according to the values in the price field:在以下操作中,输入单据根据price字段中的值分为四个桶:

db.artwork.aggregate( [
   {
     $bucketAuto: {
         groupBy: "$price",
         buckets: 4
     }
   }
] )

The operation returns the following documents:操作将返回以下文档:

{
  "_id" : {
    "min" : NumberDecimal("76.04"),
    "max" : NumberDecimal("159.00")
  },
  "count" : 2
}
{
  "_id" : {
    "min" : NumberDecimal("159.00"),
    "max" : NumberDecimal("199.99")
  },
  "count" : 2
}
{
  "_id" : {
    "min" : NumberDecimal("199.99"),
    "max" : NumberDecimal("385.00")
  },
  "count" : 2
}
{
  "_id" : {
    "min" : NumberDecimal("385.00"),
    "max" : NumberDecimal("483.00")
  },
  "count" : 2
}

Multi-Faceted Aggregation多面聚合

The $bucketAuto stage can be used within the $facet stage to process multiple aggregation pipelines on the same set of input documents from artwork.$bucketAuto阶段可以在$facet阶段中使用,以处理来自artwork的同一组输入文档上的多个聚合管道。

The following aggregation pipeline groups the documents from the artwork collection into buckets based on price, year, and the calculated area:以下聚合管道根据priceyear和计算areaartwork集合中的文档分组到桶中:

db.artwork.aggregate( [
  {
    $facet: {
      "price": [
        {
          $bucketAuto: {
            groupBy: "$price",
            buckets: 4
          }
        }
      ],
      "year": [
        {
          $bucketAuto: {
            groupBy: "$year",
            buckets: 3,
            output: {
              "count": { $sum: 1 },
              "years": { $push: "$year" }
            }
          }
        }
      ],
      "area": [
        {
          $bucketAuto: {
            groupBy: {
              $multiply: [ "$dimensions.height", "$dimensions.width" ]
            },
            buckets: 4,
            output: {
              "count": { $sum: 1 },
              "titles": { $push: "$title" }
            }
          }
        }
      ]
    }
  }
] )

The operation returns the following document:运算返回以下文档:

{
  "area" : [
    {
      "_id" : { "min" : 432, "max" : 500 },
      "count" : 3,
      "titles" : [
        "The Scream",
        "The Persistence of Memory",
        "Blue Flower"
      ]
    },
    {
      "_id" : { "min" : 500, "max" : 864 },
      "count" : 2,
      "titles" : [
        "Dancer",
        "The Pillars of Society"
      ]
    },
    {
      "_id" : { "min" : 864, "max" : 1568 },
      "count" : 2,
      "titles" : [
        "The Great Wave off Kanagawa",
        "Composition VII"
      ]
    },
    {
      "_id" : { "min" : 1568, "max" : 1568 },
      "count" : 1,
      "titles" : [
        "Melancholy III"
      ]
    }
  ],
  "price" : [
    {
      "_id" : { "min" : NumberDecimal("76.04"), "max" : NumberDecimal("159.00") },
      "count" : 2
    },
    {
      "_id" : { "min" : NumberDecimal("159.00"), "max" : NumberDecimal("199.99") },
      "count" : 2
    },
    {
      "_id" : { "min" : NumberDecimal("199.99"), "max" : NumberDecimal("385.00") },
      "count" : 2 },
    {
      "_id" : { "min" : NumberDecimal("385.00"), "max" : NumberDecimal("483.00") },
      "count" : 2
    }
  ],
  "year" : [
    { "_id" : { "min" : null, "max" : 1913 }, "count" : 3, "years" : [ 1902 ] },
    { "_id" : { "min" : 1913, "max" : 1926 }, "count" : 3, "years" : [ 1913, 1918, 1925 ] },
    { "_id" : { "min" : 1926, "max" : 1931 }, "count" : 2, "years" : [ 1926, 1931 ] }
  ]
}
←  $bucket (aggregation)$collStats (aggregation) →