Troubleshoot Sharded Clusters分片 Clusters疑难解答

~~On this page~~本页内容

~~Application Servers or mongos Instances Become Unavailable~~应用程序服务器或mongos实例变得不可用
~~A Single Member Becomes Unavailable in a Shard Replica Set~~分片副本集中的单个成员变得不可用
~~All Members of a Shard Become Unavailable~~分片的所有成员都不可用
~~A Config Server Replica Set Member Become Unavailable~~配置服务器副本集成员不可用
~~Cursor Fails Because of Stale Config Data~~由于配置数据过期，游标失败
~~Shard Keys~~分片键
~~Cluster Availability~~群集可用性
~~Config Database String Error~~配置数据库字符串错误
~~Avoid Downtime when Moving Config Servers~~移动配置服务器时避免停机
moveChunk commit failed ~~Error~~错误

~~This page describes common strategies for troubleshooting sharded cluster deployments.~~本页介绍了对分片集群部署进行故障排除的常见策略。

Application Servers or `mongos` Instances Become Unavailable应用程序服务器或`mongos`实例变得不可用

~~If each application server has its own mongos instance, other application servers can continue to access the database.~~ 如果每个应用程序服务器都有自己的mongos实例，那么其他应用程序服务器可以继续访问数据库。~~Furthermore, mongos instances do not maintain persistent state, and they can restart and become unavailable without losing any state or data.~~ 此外，mongos实例不保持持久状态，它们可以在不丢失任何状态或数据的情况下重新启动并变得不可用。~~When a mongos instance starts, it retrieves a copy of the config database and can begin routing queries.~~当mongos实例启动时，它会检索配置数据库的副本，并可以开始路由查询。

A Single Member Becomes Unavailable in a Shard Replica Set分片副本集中的单个成员变得不可用

Replica sets provide high availability for shards. If the unavailable mongod is a primary, then the replica set will elect a new primary. If the unavailable mongod is a secondary, and it disconnects the primary and secondary will continue to hold all data. In a three member replica set, even if a single member of the set experiences catastrophic failure, two other members have full copies of the data. [1]

~~Always investigate availability interruptions and failures.~~ 始终调查可用性中断和故障。~~If a system is unrecoverable, replace it and create a new member of the replica set as soon as possible to replace the lost redundancy.~~如果系统不可恢复，请将其替换，并尽快在副本集中创建一个新成员，以替换丢失的冗余。

[1]	If an unavailable secondary becomes available while it still has current oplog entries, it can catch up to the latest state of the set using the normal replication process; otherwise, it must perform an initial sync.如果一个不可用的辅助在仍然具有当前oplog条目的情况下变为可用，则它可以使用正常的复制过程来赶上集合的最新状态；否则，它必须执行初始同步。

All Members of a Shard Become Unavailable分片的所有成员都不可用

~~In a sharded cluster, mongod and mongos instances monitor the replica sets in the sharded cluster (e.g. shard replica sets, config server replica set).~~在分片集群中，mongod和mongos实例监视分片集群的副本集（例如，分片副本集、配置服务器副本集）。

~~If all members of a replica set shard are unavailable, all data held in that shard is unavailable.~~ 如果副本集分片的所有成员都不可用，则该分片中保存的所有数据都不可用。~~However, the data on all other shards will remain available, and it is possible to read and write data to the other shards.~~ 然而，所有其他分片上的数据将保持可用，并且可以向其他分片读取和写入数据。~~However, your application must be able to deal with partial results, and you should investigate the cause of the interruption and attempt to recover the shard as soon as possible.~~但是，您的应用程序必须能够处理部分结果，并且应该调查中断的原因，并尝试尽快恢复分片。

A Config Server Replica Set Member Become Unavailable配置服务器副本集成员不可用

Replica sets provide high availability for the config servers. If an unavailable config server is a primary, then the replica set will elect a new primary.

If the replica set config server loses its primary and cannot elect a primary, the cluster's metadata becomes read only. You can still read and write data from the shards, but no chunk migration or chunk splits will occur until a primary is available.

Note

~~Distributing replica set members across two data centers provides benefit over a single data center. In a two data center distribution,~~将复制副本集成员分布在两个数据中心提供了优于单个数据中心的优势。在双数据中心分布中，

~~If one of the data centers goes down, the data is still available for reads unlike a single data center distribution.~~如果其中一个数据中心出现故障，则与单个数据中心分布不同，数据仍可用于读取。
~~If the data center with a minority of the members goes down, the replica set can still serve write operations as well as read operations.~~如果拥有少数成员的数据中心宕机，副本集仍然可以提供写操作和读操作。
~~However, if the data center with the majority of the members goes down, the replica set becomes read-only.~~但是，如果包含大多数成员的数据中心出现故障，则副本集将变为只读。

~~If possible, distribute members across at least three data centers.~~ 如果可能，将成员分布在至少三个数据中心。~~For config server replica sets (CSRS), the best practice is to distribute across three (or more depending on the number of members) centers.~~ 对于配置服务器副本集（CSRS），最佳做法是分布在三个（或更多，具体取决于成员数量）中心。If the cost of the third data center is prohibitive, one distribution possibility is to evenly distribute the data bearing members across the two data centers and store the remaining member in the cloud if your company policy allows.如果第三个数据中心的成本过高，一种分发可能性是在两个数据中心之间平均分配数据承载成员，并在公司政策允许的情况下将剩余成员存储在云中。

Note

~~All config servers must be running and available when you first initiate a sharded cluster.~~当您第一次启动分片集群时，所有配置服务器都必须运行并且可用。

Cursor Fails Because of Stale Config Data由于配置数据过期，游标失败

~~A query returns the following warning when one or more of the mongos instances has not yet updated its cache of the cluster's metadata from the config database:~~当一个或多个mongos实例尚未从config数据库更新其集群元数据缓存时，查询将返回以下警告：

could not initialize cursor across all shards because : stale config detected

~~This warning should not propagate back to your application.~~ 此警告不应传播回您的应用程序。~~The warning will repeat until all the mongos instances refresh their caches.~~ 警告将重复，直到所有mongos实例刷新其缓存。~~To force an instance to refresh its cache, run the flushRouterConfig command.~~要强制实例刷新其缓存，请运行flushRouterConfig命令。

Shard Keys分片键

~~To troubleshoot a shard key, see Troubleshoot Shard Keys.~~若要对分片键进行故障排除，请参阅分片键故障排除。

Cluster Availability群集可用性

~~To ensure cluster availability:~~要确保群集可用性，请执行以下操作：

Each shard should be a replica set, if a specific mongod instance fails, the replica set members will elect another to be primary and continue operation. However, if an entire shard is unreachable or fails for some reason, that data will be unavailable.
~~The shard key should allow the mongos to isolate most operations to a single shard.~~ 分片键应该允许mongos将大多数操作隔离到单个分片。~~If operations can be processed by a single shard, the failure of a single shard will only render some data unavailable.~~ 如果操作可以由单个分片处理，那么单个分片的失败只会使一些数据不可用。~~If operations need to access all shards for queries, the failure of a single shard will render the entire cluster unavailable.~~如果操作需要访问所有分片进行查询，那么单个分片的失败将使整个集群不可用。

Config Database String Error配置数据库字符串错误

Config servers can be deployed as replica sets. The mongos instances for the sharded cluster must specify the same config server replica set name but can specify hostname and port of different members of the replica set.

With earlier versions of MongoDB sharded clusters that use the topology of three mirrored mongod instances for config servers, mongos instances in a sharded cluster must specify identical configDB string.早期版本的MongoDB分片集群使用三个镜像mongod实例的拓扑结构作为配置服务器，分片集群中的mongos实例必须指定相同的configDB字符串。

Avoid Downtime when Moving Config Servers移动配置服务器时避免停机

~~Use CNAMEs to identify your config servers to the cluster so that you can rename and renumber your config servers without downtime.~~使用CNAME识别集群中的配置服务器，这样您就可以在不停机的情况下重命名和重新编号配置服务器。

`moveChunk commit failed` Error

At the end of a chunk migration, the shard must connect to the config database to update the chunk's record in the cluster metadata. If the shard fails to connect to the config database, MongoDB reports the following error:

ERROR: moveChunk commit failed: version is at <n>|<nn> instead of
<N>|<NN>" and "ERROR: TERMINATING"

When this happens, the primary member of the shard's replica set then terminates to protect data consistency. If a secondary member can access the config database, data on the shard becomes accessible again after an election.

The user will need to resolve the chunk migration failure independently. If you encounter this issue, ask the MongoDB Community or MongoDB Support to address this issue.

← Operational Restrictions in Sharded Clusters Change Streams →