Docs HomeMongoDB Manual

Map-Reduce and Sharded CollectionsMap Reduce和分片集合

Note

Aggregation Pipeline as an Alternative to Map-Reduce聚合管道作为Map Reduce的替代方案

Starting in MongoDB 5.0, map-reduce is deprecated:从MongoDB 5.0开始,不赞成使用map-reduce:

  • Instead of map-reduce, you should use an aggregation pipeline. Aggregation pipelines provide better performance and usability than map-reduce.
  • You can rewrite map-reduce operations using aggregation pipeline stages, such as $group, $merge, and others.
  • For map-reduce operations that require custom functionality, you can use the $accumulator and $function aggregation operators, available starting in version 4.4. You can use those operators to define custom aggregation expressions in JavaScript.

For examples of aggregation pipeline alternatives to map-reduce, see:有关映射减少的聚合管道替代方案的示例,请参阅:

Map-reduce supports operations on sharded collections, both as an input and as an output. Map reduce支持对分片集合的操作,既可以作为输入,也可以作为输出。This section describes the behaviors of mapReduce specific to sharded collections.本节介绍mapReduce特定于分片集合的行为。

However, starting in version 4.2, MongoDB deprecates the map-reduce option to create a new sharded collection as well as the use of the sharded option for map-reduce. 然而,从4.2版本开始,MongoDB反对使用map-reduce选项来创建新的分片集合,并放弃了map-reduce的sharded选项。To output to a sharded collection, create the sharded collection first. MongoDB 4.2 also deprecates the replacement of an existing sharded collection.若要输出到分片集合,请首先创建分片集合。MongoDB 4.2也反对替换现有的分片集合。

Sharded Collection as Input作为输入的分片集合

When using sharded collection as the input for a map-reduce operation, mongos will automatically dispatch the map-reduce job to each shard in parallel. 当使用分片 collection作为map reduce操作的输入时,mongos会自动将map reduce作业并行分配给每个分片。There is no special option required. 不需要特殊选项。mongos will wait for jobs on all shards to finish.mongos将等待所有分片上的作业完成。

Sharded Collection as Output作为输出的分片集合

If the out field for mapReduce has the sharded value, MongoDB shards the output collection using the _id field as the shard key.如果mapReduceout字段具有分片值,MongoDB将使用_id字段作为分片键来对输出集合进行分片。

Note

Starting in version 4.2, MongoDB deprecates the use of the sharded option for mapReduce/db.collection.mapReduce().从4.2版本开始,MongoDB反对使用mapReduce/db.collection.mapReduce()sharded选项。

To output to a sharded collection:要输出到分片集合,请执行以下操作:

  • If the output collection does not exist, create the sharded collection first.如果输出集合不存在,请先创建分片集合。

    Starting in version 4.2, MongoDB deprecates the map-reduce option to create a new sharded collection and the use of the sharded option for map-reduce. 从4.2版本开始,MongoDB放弃了使用map-reduce选项来创建新的分片集合,并放弃了map-reduce的sharded选项。As such, to output to a sharded collection, create the sharded collection first.因此,要输出到分片集合,请首先创建分片集合。

    If you did not create the sharded collection first, MongoDB creates and shards the collection on the _id field. 如果您没有首先创建分片集合,MongoDB会在_id字段上创建并分片集合。However, it is recommended that you create the sharded collection first.但是,建议您先创建分片集合。

  • Starting in version 4.2, MongoDB deprecates the replacement of an existing sharded collection.从4.2版本开始,MongoDB反对替换现有的分片集合。
  • Starting in version 4.0, if the output collection already exists but is not sharded, map-reduce fails.从4.0版本开始,如果输出集合已经存在但未进行分片,则map reduce将失败。
  • For a new or an empty sharded collection, MongoDB uses the results of the first stage of the map-reduce operation to create the initial chunks distributed among the shards.对于新的或空的分片集合,MongoDB使用map reduce操作的第一阶段的结果来创建分布在分片之间的初始区块
  • mongos dispatches, in parallel, a map-reduce post-processing job to every shard that owns a chunk. During the post-processing, each shard will pull the results for its own chunks from the other shards, run the final reduce/finalize, and write locally to the output collection.mongos并行地向每个拥有区块的分片分派一个map-reduce后处理作业。在后处理过程中,每个分片将从其他分片中提取其自己的区块的结果,运行最终的reduce/finalize,并在本地写入输出集合。
Note
  • During later map-reduce jobs, MongoDB splits chunks as needed.在以后的map reduce作业中,MongoDB会根据需要拆分块。
  • Balancing of chunks for the output collection is automatically prevented during post-processing to avoid concurrency issues.在后处理过程中,会自动阻止输出集合的块平衡,以避免并发问题。