Troubleshoot Sharded Clusters分片群集故障排除

On this page本页内容

This page describes common strategies for troubleshooting sharded cluster deployments.本页介绍用于排除分片群集部署故障的常见策略。

Application Servers or mongos Instances Become Unavailable应用程序服务器或mongos实例不可用

If each application server has its own mongos instance, other application servers can continue to access the database. 如果每个应用服务器都有自己的mongos实例,则其他应用服务器可以继续访问数据库。Furthermore, mongos instances do not maintain persistent state, and they can restart and become unavailable without losing any state or data. 此外,mongos实例不保持持久状态,它们可以重新启动并变得不可用,而不会丢失任何状态或数据。When a mongos instance starts, it retrieves a copy of the config database and can begin routing queries.mongos实例启动时,它会检索配置数据库的副本,并可以开始路由查询。

A Single Member Becomes Unavailable in a Shard Replica Set单个成员在分片副本集中变得不可用

Replica sets provide high availability for shards. 副本集为分片提供了高可用性。If the unavailable mongod is a primary, then the replica set will elect a new primary. 如果不可用的mongod主服务器,则副本集将选举一个新的主服务器。If the unavailable mongod is a secondary, and it disconnects the primary and secondary will continue to hold all data. 如果不可用的mongod辅助设备,并且它断开了主设备的连接,则辅助设备将继续保存所有数据。In a three member replica set, even if a single member of the set experiences catastrophic failure, two other members have full copies of the data. 在三个成员的副本集中,即使副本集中的单个成员发生灾难性故障,其他两个成员也拥有数据的完整副本。[1]

Always investigate availability interruptions and failures. 始终调查可用性中断和故障。If a system is unrecoverable, replace it and create a new member of the replica set as soon as possible to replace the lost redundancy.如果系统不可恢复,请更换它并尽快创建副本集的新成员,以替换丢失的冗余。

[1] If an unavailable secondary becomes available while it still has current oplog entries, it can catch up to the latest state of the set using the normal replication process; otherwise, it must perform an initial sync.如果一个不可用的辅助服务器在仍有当前oplog条目的情况下变为可用,则它可以使用正常的复制过程赶上该集的最新状态;否则,它必须执行初始同步

All Members of a Shard Become Unavailable分片的所有成员都不可用

In a sharded cluster, mongod and mongos instances monitor the replica sets in the sharded cluster (e.g. shard replica sets, config server replica set).在分片集群中,mongodmongos实例监视分片集群中的副本集(例如,分片副本集、配置服务器副本集)。

If all members of a replica set shard are unavailable, all data held in that shard is unavailable. 如果副本集分片的所有成员都不可用,则该分片中保存的所有数据都不可用。However, the data on all other shards will remain available, and it is possible to read and write data to the other shards. 但是,所有其他分片上的数据将保持可用,并且可以读取和写入其他分片上的数据。However, your application must be able to deal with partial results, and you should investigate the cause of the interruption and attempt to recover the shard as soon as possible.但是,应用程序必须能够处理部分结果,您应该调查中断的原因,并尝试尽快恢复分片。

A Config Server Replica Set Member Become Unavailable配置服务器副本集成员不可用

Replica sets provide high availability for the config servers. 副本集为配置服务器提供了高可用性[2] If an unavailable config server is a primary, then the replica set will elect a new primary.如果不可用的配置服务器是主服务器,则副本集将选举一个新的主服务器。

If the replica set config server loses its primary and cannot elect a primary, the cluster's metadata becomes read only. 如果副本集配置服务器丢失了其主服务器,并且无法选择主服务器,则集群的元数据将变为只读。You can still read and write data from the shards, but no chunk migration or chunk splits will occur until a primary is available.您仍然可以读取和写入分片中的数据,但在主分片可用之前,不会发生块迁移块拆分

Note注意

Distributing replica set members across two data centers provides benefit over a single data center. 跨两个数据中心分发副本集成员比单个数据中心更具优势。In a two data center distribution,在两个数据中心的分布中,

  • If one of the data centers goes down, the data is still available for reads unlike a single data center distribution.如果其中一个数据中心宕机,数据仍然可以读取,这与单个数据中心分发不同。
  • If the data center with a minority of the members goes down, the replica set can still serve write operations as well as read operations.如果拥有少数成员的数据中心宕机,副本集仍然可以执行写操作和读操作。
  • However, if the data center with the majority of the members goes down, the replica set becomes read-only.但是,如果包含大多数成员的数据中心宕机,副本集将变为只读。

If possible, distribute members across at least three data centers. 如果可能,将成员分布在至少三个数据中心。For config server replica sets (CSRS), the best practice is to distribute across three (or more depending on the number of members) centers. 对于配置服务器副本集(CSR),最佳做法是跨三个(或更多,取决于成员数量)中心分发。If the cost of the third data center is prohibitive, one distribution possibility is to evenly distribute the data bearing members across the two data centers and store the remaining member in the cloud if your company policy allows.如果第三个数据中心的成本过高,一种分发可能性是在两个数据中心之间均匀分布承载数据的成员,并在公司政策允许的情况下将其余成员存储在云中。

Note注意

All config servers must be running and available when you first initiate a sharded cluster.第一次启动分片集群时,所有配置服务器都必须运行且可用。

[2] Starting in MongoDB 3.4, the use of three mirrored mongod instances (SCCC) as config servers is no longer supported.从MongoDB 3.4开始,不再支持将三个镜像mongod实例(SCCC)用作配置服务器。

Cursor Fails Because of Stale Config Data由于配置数据陈旧,游标失败

A query returns the following warning when one or more of the mongos instances has not yet updated its cache of the cluster's metadata from the config database:当一个或多个mongos实例尚未从配置数据库更新其集群元数据缓存时,查询将返回以下警告:

could not initialize cursor across all shards because : stale config detected

This warning should not propagate back to your application. 此警告不应传播回应用程序。The warning will repeat until all the mongos instances refresh their caches. 警告将重复,直到所有mongos实例刷新其缓存。To force an instance to refresh its cache, run the flushRouterConfig command.要强制实例刷新其缓存,请运行flushRouterConfig命令。

Shard Keys分片键

To troubleshoot a shard key, see Troubleshoot Shard Keys.要对分片键进行故障排除,请参阅对分片键进行故障排除

Cluster Availability群集可用性

To ensure cluster availability:要确保群集可用性,请执行以下操作:

  • Each shard should be a replica set, if a specific mongod instance fails, the replica set members will elect another to be primary and continue operation. 每个分片都应该是一个副本集,如果特定的mongod实例失败,副本集成员将选择另一个作为主实例并继续操作。However, if an entire shard is unreachable or fails for some reason, that data will be unavailable.但是,如果整个shard无法访问或由于某种原因失败,则该数据将不可用。
  • The shard key should allow the mongos to isolate most operations to a single shard. shard键应允许mongos将大多数操作隔离到单个shard。If operations can be processed by a single shard, the failure of a single shard will only render somedata unavailable. 如果操作可以由单个shard处理,那么单个shard的故障只会导致某些数据不可用。If operations need to access all shards for queries, the failure of a single shard will render the entire cluster unavailable.如果操作需要访问所有分片进行查询,则单个分片的故障将导致整个集群不可用。

Config Database String Error配置数据库字符串错误

Changed in version 3.2.在版本3.2中更改

Starting in MongoDB 3.2, config servers can be deployed as replica sets. 从MongoDB 3.2开始,可以将配置服务器部署为副本集。The mongos instances for the sharded cluster must specify the same config server replica set name but can specify hostname and port of different members of the replica set.分片集群的mongos实例必须指定相同的配置服务器副本集名称,但可以指定副本集不同成员的主机名和端口。

Starting in 3.4, the use of the deprecated mirrored mongod instances as config servers (SCCC) is no longer supported. 从3.4开始,不再支持将已弃用的镜像mongod实例用作配置服务器(SCCC)。Before you can upgrade your sharded clusters to 3.4, you must convert your config servers from SCCC to CSRS.在将分片集群升级到3.4之前,必须将配置服务器从SCCC转换为CSR。

To convert your config servers from SCCC to CSRS, see the MongoDB 3.4 manual Upgrade Config Servers to Replica Set.要将配置服务器从SCCC转换为CSR,请参阅MongoDB 3.4手动将配置服务器升级到副本集

With earlier versions of MongoDB sharded clusters that use the topology of three mirrored mongod instances for config servers, mongos instances in a sharded cluster must specify identical configDB string.对于早期版本的MongoDB分片集群,它们使用配置服务器的三个镜像mongod实例的拓扑结构,分片集群中的mongos实例必须指定相同的configDB字符串。

Avoid Downtime when Moving Config Servers移动配置服务器时避免停机

Use CNAMEs to identify your config servers to the cluster so that you can rename and renumber your config servers without downtime.使用CNAMEs在集群中标识配置服务器,以便在不停机的情况下重命名和重新编号配置服务器。

moveChunk commit failed Error错误

At the end of a chunk migration, the shard must connect to the config database to update the chunk's record in the cluster metadata. 区块迁移结束时,分片必须连接到config数据库,以更新集群元数据中区块的记录。If the shard fails to connect to the config database, MongoDB reports the following error:如果分片无法连接到配置数据库,MongoDB将报告以下错误:

ERROR: moveChunk commit failed: version is at <n>|<nn> instead of
<N>|<NN>" and "ERROR: TERMINATING"

When this happens, the primary member of the shard's replica set then terminates to protect data consistency. 发生这种情况时,分片副本集的主要成员将终止以保护数据一致性。If a secondary member can access the config database, data on the shard becomes accessible again after an election.如果辅助成员可以访问配置数据库,则在选择后可以再次访问分片上的数据。

The user will need to resolve the chunk migration failure independently. 用户需要独立解决区块迁移故障。If you encounter this issue, ask the MongoDB Community or MongoDB Support to address this issue.如果遇到此问题,请向MongoDB社区MongoDB支持人员咨询以解决此问题。

←  Operational Restrictions in Sharded ClustersConfig Database →