Database Manual / Data Modeling

Handle Duplicate Data处理重复数据

When you embed related data in a single document, you may duplicate data between two collections. Duplicating data lets your application query related information about multiple entities in a single query while logically separating entities in your model.当您在单个文档中嵌入相关数据时,您可能会在两个集合之间复制数据。复制数据使应用程序可以在单个查询中查询有关多个实体的相关信息,同时在模型中逻辑上分离实体。

About this Task关于此任务

One concern with duplicating data is increased storage costs. However, the benefits of optimizing access patterns generally outweigh potential cost increases from storage.复制数据的一个问题是存储成本增加。然而,优化访问模式的好处通常超过了存储带来的潜在成本增加。

Before you duplicate data, consider the following factors:在复制数据之前,请考虑以下因素:

  • How often the duplicated data needs to be updated. Frequently updating duplicated data can cause heavy workloads and performance issues. However, the extra logic needed to handle infrequent updates is less costly than performing joins (lookups) on read operations.重复数据需要多久更新一次。频繁更新重复数据会导致繁重的工作负载和性能问题。然而,处理不频繁更新所需的额外逻辑比在读取操作上执行连接(查找)成本更低。
  • The performance benefit for reads when data is duplicated. Duplicating data can remove the need to perform joins across multiple collections, which can improve application performance.数据复制时读取的性能优势。复制数据可以消除跨多个集合执行联接的需要,从而提高应用程序性能。

Example: Duplicate Data in an E-Commerce Schema示例:电子商务模式中的重复数据

The following example shows how to duplicate data in an e-commerce application schema to improve data access and performance.以下示例显示了如何在电子商务应用程序模式中复制数据,以提高数据访问和性能。

Steps步骤

1

Switch to the eCommerce database切换到eCommerce(电子商务)数据库

use eCommerce
2

Populate the database填充数据库

Create the following collections in the eCommerce database:eCommerce数据库中创建以下集合:

Collection Name集合名称Description描述Sample Document示例文档
customersStores customer information such as name, email, and phone number.存储客户信息,如姓名、电子邮件和电话号码。
db.customers.insertOne( {
customerId: 123,
name: "Alexa Edwards",
email: "a.edwards@randomEmail.com",
phone: "202-555-0183"
} )
productsStores product information such as price, size, and material.存储产品信息,如价格、尺寸和材料。
db.products.insertOne( {
productId: 456,
product: "sweater",
price: 30,
size: "L",
material: "silk",
manufacturer: "Cool Clothes Co"
} )
ordersStores order information such as date and total price. Documents in the orders collection embed the corresponding products for that order in the lineItems field.存储订单信息,如日期和总价。orders集合中的文档将该订单的相应产品嵌入到lineItems字段中。
db.orders.insertOne( {
orderId: 789,
customerId: 123,
totalPrice: 45,
date: ISODate("2023-05-22"),
lineItems: [
{
productId: 456,
product: "sweater",
price: 30,
size: "L"
},
{
productId: 809,
product: "t-shirt",
price: 10,
size: "M"
},
{
productId: 910,
product: "socks",
price: 5,
size: "S"
}
]
} )

The following properties from the products collection are duplicated in the orders collection:products集合中的以下属性在orders集合中重复:

  • productId
  • product
  • price
  • size

Benefits of Duplicating Data复制数据的好处

When the application displays order information, it displays the corresponding order's line items. If the order and product information were stored in separate collections, the application would need to perform a $lookup to join data from two collections. Lookup operations are often expensive and have poor performance.当应用程序显示订单信息时,它会显示相应订单的行项目。如果订单和产品信息存储在单独的集合中,应用程序将需要执行$lookup以连接来自两个集合的数据。查找操作通常很昂贵,性能也很差。

The reason to duplicate product information as opposed to only embedding line items in the orders collection is that the application only needs a subset of product information when displaying orders. 与仅在orders集合中嵌入行项目不同,复制产品信息的原因是应用程序在显示订单时只需要产品信息的子集。By only embedding the required fields, the application can store additional product details without adding unnecessary bloat to the orders collection.通过仅嵌入必填字段,应用程序可以存储额外的产品详细信息,而不会给orders集合增加不必要的冗余。

Example: Duplicate Data for Product Reviews示例:产品评论的重复数据

The following example uses the subset pattern to optimize access patterns for an online store.以下示例使用子集模式来优化在线商店的访问模式。

Consider an application where when user views a product, the application displays the product's information and five most recent reviews. The reviews are stored in both a products collection and a reviews collection.考虑一个应用程序,当用户查看产品时,该应用程序会显示产品的信息和五条最新评论。评论存储在products集合和reviews集合中。

When a new review is written, the following writes occur:当撰写新的评论时,会出现以下内容:

  • The review is inserted into the reviews collection.该评论被插入到reviews集合中。
  • The array of recent reviews in the products collection is updated with $pop and $push.products集合中的一系列最新评论被更新,用到了$pop$push

Steps步骤

1

Switch to the productsAndReviews database切换到productsAndReviews数据库

use productsAndReviews
2

Populate the database填充数据库

Create the following collections in the productsAndReviews database:productsAndReviews数据库中创建以下集合:

Collection Name集合名称Description描述Sample Document示例文档
productsStores product information. Documents in the products collection embed the five most recent product reviews in the recentReviews field.存储产品信息。products集合中的文档在recentReviews字段中嵌入了五条最新的产品评论。
db.products.insertOne( {
productId: 123,
name: "laptop",
price: 200,
recentReviews: [
{
reviewId: 456,
author: "Pat Simon",
stars: 4,
comment: "Great for schoolwork",
date: ISODate("2023-06-29")
},
{
reviewId: 789,
author: "Edie Short",
stars: 2,
comment: "Not enough RAM",
date: ISODate("2023-06-22")
}
]
} )
reviewsStores all reviews for products (not only recent reviews). Documents in the reviews collection contain a productId field that indicates the product that the review pertains to.存储产品的所有评论(不仅是最近的评论)。reviews集合中的文档包含一个productId字段,用于指示评论所属的产品。
db.reviews.insertOne( {
reviewId: 456,
productId: 123,
author: "Pat Simon",
stars: 4,
comment: "Great for schoolwork",
date: ISODate("2023-06-29")
} )

Benefits of Duplicating Data复制数据的好处

The application only needs to make one call to the database to return the all information it needs to display. If data was stored entirely in separate collections, the application would need to join data from the products and reviews collection, which could cause performance issues.应用程序只需要对数据库进行一次调用,就可以返回它需要显示的所有信息。如果数据完全存储在单独的集合中,应用程序将需要连接productsreviews集合中的数据,这可能会导致性能问题。

Reviews are rarely updated, so it is not expensive to store duplicate data and keeping the data consistent between collections is not a challenge.评论很少更新,因此存储重复数据并不昂贵,保持集合之间的数据一致也不是一个挑战。

Learn More了解更多

To learn how to keep duplicate data consistent, see Data Consistency.要了解如何保持重复数据的一致性,请参阅数据一致性