Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.分片是一种在多台机器上分发数据的方法。MongoDB使用分片来支持具有非常大的数据集和高吞吐量操作的部署。
Database systems with large data sets or high throughput applications can challenge the capacity of a single server. For example, high query rates can exhaust the CPU capacity of the server. Working set sizes larger than the system's RAM stress the I/O capacity of disk drives.具有大数据集或高吞吐量应用程序的数据库系统可能会挑战单个服务器的容量。例如,高查询率可能会耗尽服务器的CPU容量。大于系统RAM的工作集大小会对磁盘驱动器的I/O容量造成压力。
There are two methods for addressing system growth: vertical and horizontal scaling.有两种方法可以解决系统增长问题:垂直和水平扩展。
Vertical Scaling involves increasing the capacity of a single server, such as using a more powerful CPU, adding more RAM, or increasing the amount of storage space. Limitations in available technology may restrict a single machine from being sufficiently powerful for a given workload. Additionally, Cloud-based providers have hard ceilings based on available hardware configurations. As a result, there is a practical maximum for vertical scaling.垂直扩展涉及增加单个服务器的容量,例如使用更强大的CPU、添加更多RAM或增加存储空间。现有技术的局限性可能会限制一台机器在给定的工作负载下足够强大。此外,基于云的提供商根据可用的硬件配置有硬上限。因此,垂直缩放有一个实际的最大值。
Horizontal Scaling involves dividing the system dataset and load over multiple servers, adding additional servers to increase capacity as required. While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, potentially providing better efficiency than a single high-speed high-capacity server. Expanding the capacity of the deployment only requires adding additional servers as needed, which can be a lower overall cost than high-end hardware for a single machine. The trade off is increased complexity in infrastructure and maintenance for the deployment.横向扩展涉及将系统数据集和负载划分到多个服务器上,根据需要添加额外的服务器以增加容量。虽然单台机器的整体速度或容量可能不高,但每台机器都处理整体工作负载的一个子集,可能比单台高速高容量服务器提供更好的效率。扩展部署容量只需要根据需要添加额外的服务器,这可能比单台机器的高端硬件的总体成本更低。代价是部署的基础设施和维护的复杂性增加。
MongoDB supports horizontal scaling through sharding.MongoDB支持通过分片进行水平扩展。
You can shard collections in the UI for deployments hosted in MongoDB Atlas.您可以在MongoDB Atlas中托管的部署的UI中对集合进行分片。
Sharded Cluster分片集群
Note
A MongoDB sharded cluster consists of the following components:MongoDB分片集群由以下组件组成:
- shard
: Each shard contains a subset of the sharded data. Each shard must be deployed as a replica set.:每个分片都包含分片数据的一个子集。每个分片都必须部署为副本集。 Routing with mongos: The使用mongosacts as a query router, providing an interface between client applications and the sharded cluster.mongos进行路由:mongos充当查询路由器,在客户端应用程序和分片集群之间提供接口。config servers: Config servers store metadata and configuration settings for the cluster. Config servers must be deployed as a replica set (CSRS).配置服务器:配置服务器存储集群的元数据和配置设置。配置服务器必须作为副本集(CSRS)部署。
The following graphic describes the interaction of components within a sharded cluster:下图描述了分片集群中组件的交互:
MongoDB shards data at the collection level, distributing the collection data across the shards in the cluster.MongoDB在集合级别对数据进行分片,将集合数据分布在集群中的分片上。
Shard Keys分片钥匙
MongoDB uses the shard key to distribute the collection's documents across shards. The shard key consists of a field or multiple fields in the documents.MongoDB使用分片键在分片之间分发集合的文档。分片键由文档中的一个或多个字段组成。
Documents in sharded collections can be missing the shard key fields. Missing shard key fields are treated as having null values when distributing the documents across shards but not when routing queries. For more information, see Missing Shard Key Fields.分片集合中的文档可能缺少分片键字段。在跨分片分发文档时,缺少的分片键字段被视为具有null值,但在路由查询时则不会。有关详细信息,请参阅缺少分片键段。
You select the shard key when sharding a collection.在对集合进行分片时,您可以选择分片键。
Starting in MongoDB 5.0, you can reshard a collection by changing a collection's shard key.从MongoDB 5.0开始,您可以通过更改集合的分片键来重新分片集合。You can refine a shard key by adding a suffix field or fields to the existing shard key.您可以通过向现有分片键添加一个或多个后缀字段来细化分片键。
A document's shard key value determines its distribution across the shards. You can update a document's shard key value unless your shard key field is the immutable 文档的分片键值决定了它在分片中的分布。您可以更新文档的分片键值,除非分片键字段是不可变的_id field. For more information, see Change a Document's Shard Key Value._id字段。有关更多信息,请参阅更改文档的分片键值。
Shard Key Index分片关键指数
To shard a populated collection, the collection must have an index that starts with the shard key. When sharding an empty collection, MongoDB creates the supporting index if the collection does not already have an appropriate index for the specified shard key. 要对已填充的集合进行分片,该集合必须有一个以分片键开头的索引。当对空集合进行分片时,如果该集合还没有指定分片键的适当索引,MongoDB会创建支持索引。See Shard Key Indexes.请参见分片关键索引。
Shard Key Strategy分片化关键战略
The choice of shard key affects the performance, efficiency, and scalability of a sharded cluster. A cluster with the best possible hardware and infrastructure can be bottlenecked by the choice of shard key. The choice of shard key and its backing index can also affect the sharding strategy that your cluster can use.分片键的选择会影响分片集群的性能、效率和可扩展性。具有最佳硬件和基础设施的集群可能会因分片键的选择而受到瓶颈。分片键及其支持索引的选择也会影响集群可以使用的分片策略。
Chunks块
MongoDB partitions sharded data into chunks. Each chunk has an inclusive lower and exclusive upper range based on the shard key.MongoDB将分片数据分割成块。每个块都有一个基于分片键的包容性下限和排他性上限。
Balancer and Even Data Distribution平衡器和均匀数据分布
In an attempt to achieve an even distribution of data across all shards in the cluster, a balancer runs in the background to migrate ranges across the shards.为了在集群中的所有分片上实现数据的均匀分布,平衡器在后台运行以跨分片迁移范围。
Advantages of Sharding分片的优点
Reads / Writes读/写
MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. MongoDB将读写工作负载分布在分片集群中的分片上,允许每个分片处理集群操作的一个子集。Both read and write workloads can be scaled horizontally across the cluster by adding more shards.通过添加更多分片,可以在集群中水平扩展读写工作负载。
For queries that include the shard key or the prefix of a compound shard key, 对于包含分片键或复合分片键前缀的查询,mongos can target the query at a specific shard or set of shards. mongos可以将查询定位在特定的分片或一组分片上。These targeted operations are generally more efficient than broadcasting to every shard in the cluster.这些有针对性的操作通常比向集群中的每个分片广播更有效。
Storage Capacity存储容量
Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster.分片将数据分布在集群中的各个分片上,允许每个分片包含集群总数据的一个子集。随着数据集的增长,额外的分片增加了集群的存储容量。
High Availability高可用性
The deployment of config servers and shards as replica sets provide increased availability.配置服务器和分片作为副本集的部署提供了更高的可用性。
Even if one or more shard replica sets become completely unavailable, the sharded cluster can continue to perform partial reads and writes. That is, while data on the unavailable shard(s) cannot be accessed, reads or writes directed at the available shards can still succeed.即使一个或多个分片副本集完全不可用,分片集群也可以继续执行部分读写操作。也就是说,虽然无法访问不可用分片上的数据,但针对可用分片的读取或写入仍然可以成功。
Considerations Before Sharding分片前的注意事项
Sharded cluster infrastructure requirements and complexity require careful planning, execution, and maintenance.分片化的集群基础设施要求和复杂性需要仔细的规划、执行和维护。
While you can reshard your collection later, it is important to carefully consider your shard key choice to avoid scalability and performance issues.虽然您可以稍后重新标记集合,但重要的是要仔细考虑分片键选择,以避免可扩展性和性能问题。
To understand the operational requirements and restrictions for sharding your collection, see Operational Restrictions in Sharded Clusters.要了解对集合进行分片的操作要求和限制,请参阅分片集群中的操作限制。
If queries do not include the shard key or the prefix of a compound shard key, 如果查询不包括分片键或复合分片键的前缀,mongos performs a broadcast operation, querying all shards in the sharded cluster. mongos将执行广播操作,查询分片集群中的所有分片。These scatter/gather queries can be long running operations.这些分散/聚集查询可能是长时间运行的操作。
Starting in MongoDB 5.1, when starting, restarting or adding a shard server with 从MongoDB 5.1开始,在启动、重新启动或使用sh.addShard() the Cluster Wide Write Concern (CWWC) must be set.sh.addShard()添加分片服务器时,必须设置集群范围写入关注(CWWC)。
If the 如果未设置CWWC is not set and the shard is configured such that the default write concern is { w : 1 } the shard server will fail to start or be added and returns an error.CWWC,并且分片配置为默认写入关注为{ w : 1 },则分片服务器将无法启动或被添加并返回错误。
See default write concern calculations for details on how the default write concern is calculated.有关如何计算默认写入关注的详细信息,请参阅默认写入关注计算。
Note
If you have an active support contract with MongoDB, consider contacting your account representative for assistance with sharded cluster planning and deployment.如果您与MongoDB签订了有效的支持合同,请考虑联系客户代表,以获得分片集群规划和部署方面的帮助。
Sharded and Non-Sharded Collections分片和非分片集合
A database can have a mixture of sharded and unsharded collections. 数据库可以包含分片和非分片集合。Sharded collections are partitioned and distributed across the shards in the cluster. Unsharded collections can be located on any shard but cannot span across shards.分片集合被分区并分布在集群中的分片上。未分片的集合可以位于任何分片上,但不能跨越分片。
Connecting to a Sharded Cluster连接到分片集群
You must connect to a mongos router to interact with any collection in the sharded cluster. 您必须连接到mongos路由器才能与分片集群中的任何集合进行交互。This includes sharded and unsharded collections. Clients should never connect to a single shard in order to perform read or write operations.这包括分片和非分片集合。客户端永远不应该连接到单个分片来执行读取或写入操作。
You can connect to a 您可以连接到mongos the same way you connect to a mongod using the mongosh or a MongoDB driver.mongos,就像使用mongosh或MongoDB驱动程序连接到mongod一样。
Note
Starting in MongoDB 8.0, you can only run certain commands on shards. If you attempt to connect directly to a shard and run an unsupported command, MongoDB returns an error:从MongoDB 8.0开始,您只能在分片上运行某些命令。如果您尝试直接连接到分片并运行不受支持的命令,MongoDB将返回错误:
"You are connecting to a sharded cluster improperly by connecting directly
to a shard. Please connect to the cluster via a router (mongos)."
To run a non-supported database command directly against a shard, you must either connect to 要直接对分片运行不受支持的数据库命令,您必须连接到mongos or have the maintenance-only directShardOperations role.mongos或拥有仅负责维护的directShardOperations角色。
Sharding Strategy分片化战略
MongoDB supports two sharding strategies for distributing data across sharded clusters.MongoDB支持两种分片策略,用于在分片集群之间分发数据。
Hashed Sharding哈希分片
Hashed Sharding involves computing a hash of the shard key field's value. Each chunk is then assigned a range based on the hashed shard key values.哈希分片涉及计算分片键字段值的哈希值。然后,根据哈希的分片键值为每个块分配一个范围。
Tip
MongoDB automatically computes the hashes when resolving queries using hashed indexes. Applications do not need to compute hashes.MongoDB在使用哈希索引解析查询时会自动计算哈希值。应用程序不需要计算哈希值。
While a range of shard keys may be "close", their hashed values are unlikely to be on the same chunk. Data distribution based on hashed values facilitates more even data distribution, especially in data sets where the shard key changes monotonically.虽然一系列分片键可能“接近”,但它们的哈希值不太可能在同一块上。基于哈希值的数据分布有助于更均匀的数据分布,特别是在分片键单调变化的数据集中。
However, hashed distribution means that range-based queries on the shard key are less likely to target a single shard, resulting in more cluster wide broadcast operations然而,哈希分布意味着基于分片键的范围查询不太可能针对单个分片,从而导致更多的集群范围广播操作
See Hashed Sharding for more information.有关更多信息,请参阅哈希分片。
Ranged Sharding测距分片
Ranged sharding involves dividing data into ranges based on the shard key values. Each chunk is then assigned a range based on the shard key values.范围分片涉及根据分片键值将数据划分为范围。然后根据分片键值为每个块分配一个范围。
A range of shard keys whose values are "close" are more likely to reside on the same chunk. 值为“close”的一系列分片键更有可能位于同一块上。This allows for targeted operations as a 这允许有针对性的操作,因为mongos can route the operations to only the shards that contain the required data.mongos可以将操作路由到仅包含所需数据的分片。
The efficiency of ranged sharding depends on the shard key chosen. Poorly considered shard keys can result in uneven distribution of data, which can negate some benefits of sharding or can cause performance bottlenecks. 范围分片的效率取决于所选择的分片键。考虑不周的分片键可能会导致数据分布不均,这可能会抵消分片的一些好处,或者导致性能瓶颈。See shard key selection for range-based sharding.有关基于范围的分片,请参阅分片键选择。
See Ranged Sharding for more information.有关更多信息,请参阅范围分片。
Zones in Sharded Clusters分片集群中的区域
Zones can help improve the locality of data for sharded clusters that span multiple data centers.区域可以帮助改善跨越多个数据中心的分片集群的数据局部性。
In sharded clusters, you can create zones of sharded data based on the shard key. 在分片集群中,您可以根据分片键创建分片数据区域。You can associate each zone with one or more shards in the cluster. A shard can associate with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone.您可以将每个区域与集群中的一个或多个分片相关联。一个分片可以与任意数量的区域相关联。在平衡集群中,MongoDB只将区域覆盖的块迁移到与该区域关联的那些分片。
Each zone covers one or more ranges of shard key values. Each range a zone covers is always inclusive of its lower boundary and exclusive of its upper boundary.每个区域覆盖一个或多个分片键值范围。一个区域覆盖的每个范围总是包括其下限,不包括其上限。
You must use fields contained in the shard key when defining a new range for a zone to cover. If using a compound shard key, the range must include the prefix of the shard key. 在定义要覆盖的区域的新范围时,必须使用分片键中包含的字段。如果使用复合分片键,则范围必须包含分片键的前缀。See shard keys in zones for more information.有关更多信息,请参阅区域中的分片键。
The possible use of zones in the future should be taken into consideration when choosing a shard key.在选择分片键时,应考虑未来可能使用的区域。
Tip
Setting up zones and zone ranges before you shard an empty or a non-existing collection allows for a faster setup of zoned sharding.在对空集合或不存在的集合进行分片之前设置区域和区域范围,可以更快地设置分区分片。
Collations in Sharding分片整理
Use the 使用shardCollection command with the collation : { locale : "simple" } option to shard a collection which has a default collation. Successful sharding requires that:shardCollection命令和collation : { locale : "simple" }选项对具有默认排序规则的集合进行分片。成功的分片需要:
The collection must have an index whose prefix is the shard key集合必须有一个前缀为分片键的索引The index must have the collation索引必须具有排序规则{ locale: "simple" }{ locale: "simple" }
When creating new collections with a collation, ensure these conditions are met prior to sharding the collection.在使用排序规则创建新集合时,请确保在对集合进行分片之前满足这些条件。
Note
Queries on the sharded collection continue to use the default collation configured for the collection. To use the shard key index's 对分片集合的查询继续使用为该集合配置的默认排序规则。要使用分片键索引的simple collation, specify {locale : "simple"} in the query's collation document.simple排序规则,请在查询的排序规则文档中指定{locale : "simple"}。
See 有关分片和排序规则的更多信息,请参阅shardCollection for more information about sharding and collation.shardCollection。
Change Streams更改流
Change streams are available for replica sets and sharded clusters. Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a collection or collections.更改流可用于副本集和分片集群。变更流允许应用程序访问实时数据变更,而不会带来跟踪oplog的复杂性和风险。应用程序可以使用更改流订阅一个或多个集合上的所有数据更改。
Transactions事务
With the introduction of distributed transactions, multi-document transactions are available on sharded clusters.随着分布式事务的引入,多文档事务在分片集群上可用。
Until a transaction commits, the data changes made in the transaction are not visible outside the transaction.在事务提交之前,事务中所做的数据更改在事务外部不可见。
However, when a transaction writes to multiple shards, not all outside read operations need to wait for the result of the committed transaction to be visible across the shards. For example, if a transaction is committed and write 1 is visible on shard A but write 2 is not yet visible on shard B, an outside read at read concern 然而,当一个事务写入多个分片时,并非所有外部读取操作都需要等待提交事务的结果在分片之间可见。例如,如果一个事务已提交,并且写1在分片a上可见,但写2在分片B上尚不可见,则外部读取关注"local" can read the results of write 1 without seeing write 2."local"可以读取写1的结果,而看不到写2。
Learn More了解更多
Practical MongoDB Aggregations E-Book实用MongoDB聚合电子书
For more information on how sharding works with aggregations, read the sharding chapter in the Practical MongoDB Aggregations e-book.有关分片如何与聚合一起工作的更多信息,请阅读《实用MongoDB聚合》电子书中的分片章节。