Database Manual / Data Modeling / Schema Design Patterns / Group Data

Group Data with the Subset Pattern使用子集模式对数据进行分组

MongoDB keeps frequently accessed data, referred to as the working set, in RAM. When the working set of data and indexes grows beyond the physical RAM allotted, performance is reduced as disk accesses starts to occur and data is no longer retrieved from RAM.MongoDB将频繁访问的数据(称为工作集)保存在RAM中。当数据和索引的工作集增长到分配的物理RAM之外时,随着磁盘访问开始发生并且不再从RAM中检索数据,性能会降低。

To solve this problem, you can shard your collection. However, sharding can create additional costs and complexities that your application may not be ready for. Rather than sharding your collection, you can reduce the size of your working set by using the subset pattern.为了解决这个问题,你可以对集合进行分片。然而,分片可能会产生额外的成本和复杂性,而应用程序可能还没有准备好。您可以通过使用子集模式来减小工作集的大小,而不是对集合进行分片。

The subset pattern is a data modeling technique used to handle scenarios where you have a large array of items within a document, but need to access frequently a small subset of those items. In this case, the document size can often cause the working set to exceed the computer's RAM capacities. The subset pattern helps optimize performance by reducing the amount of data that needs to be read from the database for common queries.子集模式是一种数据建模技术,用于处理文档中有大量项目,但需要频繁访问这些项目的一小部分的情况。在这种情况下,文档大小通常会导致工作集超过计算机的RAM容量。子集模式通过减少常见查询需要从数据库读取的数据量来帮助优化性能。

About this Task关于此任务

Consider an e-commerce site that has a list of reviews for a product, stored in a collection called products. The e-commerce site inserts documents with the following schema into the products collection:考虑一个电子商务网站,它有一个产品的评论列表,存储在一个名为products的集合中。电子商务网站将具有以下架构的文档插入到products集合中:

db.collection('products').insertOne( [
{
_id: ObjectId("507f1f77bcf86cd99338452"),
name: "Super Widget",
description: "This is the most useful item in your toolbox."
price: { value: Decimal128("119.99"), currency: "USD" },
reviews: [
{
review_id: 786,
review_author: "Kristina",
review_text: "This is indeed an amazing widgt.",
published_date: ISODate("2019-02-18")
},
{
review_id: 785,
review_author: "Trina",
review_text: "Very nice product, slow shipping.",
published_date: ISODate("2019-02-17")
},
[...],
{
review_id: 1,
review_author: "Hans",
review_text: "Meh, it's ok.",
published_date: ISODate("2017-12-06")
}
]
}
] )

When accessing a product’s data, you likely only need the most recent reviews. The following procedure demonstrates how to apply the subset pattern to the above schema.在访问产品数据时,您可能只需要最新的评论。以下过程演示了如何将子集模式应用于上述模式。

Steps步骤

1

Identify the subset of frequently accessed data.识别频繁访问的数据子集。

In an array field containing information about a document, determine the subset of information you need to access the most. For example, in the products collection, you might only need to access the ten most recent reviews.在包含文档信息的数组字段中,确定最需要访问的信息子集。例如,在products集合中,您可能只需要访问最近的十条评论。

2

Separate the subset into different collections.将子集分成不同的集合。

Instead of storing all the reviews with the product, split your collection into two collections: one for your most accessed data, and one for your least accessed data. This allows for quick access to the most relevant data without having to load the entire array.不要将所有评论与产品一起存储,而是将集合分为两个集合:一个用于访问最多的数据,另一个用于存储访问最少的数据。这允许快速访问最相关的数据,而无需加载整个数组。

The first collection, the products collection, contains the most frequently used data, such as current reviews:第一个集合,即products集合,包含最常用的数据,如当前评论:

db.collection('products').insertOne( [
{
_id: ObjectId("507f1f77bcf86cd99338452"),
name: "Super Widget",
description: "This is the most useful item in your toolbox."
price: { value: Decimal128("119.99"), currency: "USD" },
reviews: [
{
review_id: 786,
review_author: "Kristina",
review_text: "This is indeed an amazing widget.",
published_date: ISODate("2019-02-18")
},
[...],
{
review_id: 776,
review_author: "Pablo",
review_text: "Amazing!",
published_date: ISODate("2019-02-15")
}
]
}
] )

The products collection only contains the ten most recent reviews. This reduces the working set by only loading in a portion, or a subset, of the overall data.products集合仅包含十条最新评论。这通过仅加载整体数据的一部分或子集来减少工作集。

The second collection, the reviews collection, contains less frequently used data, such as old reviews:第二个集合,即reviews集合,包含使用频率较低的数据,如旧评论:

db.collection('review').insertOne( [
{
review_id: 786,
review_author: "Kristina",
review_text: "This is indeed an amazing widget.",
product_id: ObjectId("507f1f77bcf86cd99338452"),
published_date: ISODate("2019-02-18")
},
{
review_id: 785,
review_author: "Trina",
review_text: "Very nice product, slow shipping.",
product_id: ObjectId("507f1f77bcf86cd99338452"),
published_date: ISODate("2019-02-17")
},
[...],
{
review_id: 1,
review_author: "Hans",
review_text: "Meh, it's ok.",
product_id: ObjectId("507f1f77bcf86cd99338452"),
published_date: ISODate("2017-12-06")
}
] )

You can access the reviews collection whenever you need to see additional reviews. When considering where to split your data, store the most used fields of your documents in your main collection and the less frequently used data in a new collection.您可以在需要查看其他评论时访问reviews集合。在考虑将数据拆分到何处时,将文档中最常用的字段存储在主集合中,将不常用的数据存储在新集合中。

Results结果

By using smaller documents with more frequently accessed data, you reduce the overall size of the working set. This allows for shorter disk access times for the most frequently used information that your application needs.通过使用具有更频繁访问数据的较小文档,您可以减小工作集的整体大小。这可以缩短应用程序所需的最常用信息的磁盘访问时间。

Note

The subset pattern requires you to manage two collections, rather than one, as well as query multiple databases when you need to gather comprehensive information on a document, rather than the subset.子集模式要求您管理两个集合,而不是一个集合,并在需要集合关于文档而不是子集的全面信息时查询多个数据库。