Database Manual / Data Modeling / Schema Design Patterns / Versioning

Slowly Changing Dimensions渐变维度

Slowly changing dimensions (SCDs) is a framework for managing and tracking changes to dimension data in a data warehouse over time. This framework refers to the dimensions as "slowly changing" because it assumes that the data SCDs cover changes with a low frequency, but without any apparent pattern in time. Use SCDs when the requirements for the data warehouse cover functionality to track and reproduce outputs based on historical states of data.缓慢变化维度(SCD)是一个框架,用于管理和跟踪数据仓库中维度数据随时间的变化。该框架将维度称为“缓慢变化”,因为它假设数据SCD覆盖的变化频率较低,但没有任何明显的时间模式。当数据仓库的要求涵盖基于历史数据状态跟踪和再现输出的功能时,使用SCD。

A common use case for SCDs is reporting. For example, in financial reporting systems, you need to explain the differences between the aggregated values in a report produced last month and those in the current version of the report from the data warehouse.SCD的一个常见用例是报告。例如,在财务报告系统中,您需要解释上个月生成的报告中的汇总值与数据仓库中当前版本的报告中汇总值之间的差异。

The different implementations of SCDs in SQL are referred to as "types." Types 0 and 1, the most basic types, only keep track of the original state of data or the current state of data, respectively. Type 2, the most commonly applied implementation, creates three new fields: validFrom, validTo, and an optional flag on the latest set of data, often called isValid or isEffective.SQL中SCD的不同实现称为“类型”。最基本的类型0和1分别只跟踪数据的原始状态或当前状态。类型2是最常用的实现,它创建了三个新字段:validFromvalidTo和最新数据集上的可选标志,通常称为isValidisEffective

SCD TypesSCD类型

SCD TypeSCD类型Description描述
Type 0Only keep original state and data cannot be changed.只保留原始状态,数据不能更改。
Type 1Only keep updated state and history cannot be stored.只能保持更新状态,不能存储历史记录。
Type 2Keep history in a new document.将历史记录保存在新文档中。
Type 3Keep history in new fields in the same document.将历史记录保存在同一文档的新字段中。
Type 4Keep history in a separate collection.将历史记录保存在单独的集合中。
Type 6Combination of Type 2 and Type 3.2型和3型的组合。

SCDs in MongoDBMongoDB中的SCD

You can apply the SCD framework to MongoDB in the same way you apply it to a relational database. The concept of slowly changing dimensions applies on a per-document basis in the chosen and optimized data model for the specific use case.您可以将SCD框架应用于MongoDB,就像将其应用于关系数据库一样。在为特定用例选择和优化的数据模型中,缓慢变化维度的概念适用于每个文档。

Example示例

Consider a collection called prices that stores the prices of a set of items. You need to keep track of the changes of the price of an item over time in order to be able to process returns of an item, as the money refunded must match the price of the item at the time of purchase. Each document in the collection has an item and price field:

db.prices.insertMany( [
{ 'item': 'shorts', 'price': 10 },
{ 'item': 't-shirt', 'price': 2 },
{ 'item': 'pants', 'price': 5 },
] )

Suppose the price of pants changes from 5 to 7. To track this price change, assume the default values for the necessary data fields for SCD Type 2. The default value for validFrom is 01.01.1900, validTo is 01.01.9999, and isValid is true. To change the price field in the object with 'item': 'pants', insert a new document to represent the current state of the pants, and update the previously valid document to no longer be valid:

let now = new Date();

db.prices.updateOne(
{
'item': 'pants',
"$or": [
{ "isValid": false },
{ "isValid": null }
]
},
{ "$set":
{
"validFrom": new Date("1900-01-01"),
"validTo": now,
"isValid": false
}
}
);

db.prices.insertOne(
{
'item': 'pants',
'price': 7,
"validFrom": now,
"validTo": new Date("9999-01-01"),
"isValid": true
}
);

To avoid breaking the chain of validity, ensure that both of the above database operation occur at the same timestamp. Depending on the requirements of the application, you can wrap the two above commands into a transaction to ensure MongoDB always applies both changes together. For more information, see Transactions.

The following operation demonstrates how to query the latest price of the document containing the pants item:

db.prices.find( { 'item': 'pants', 'isValid': true } );

To query for the price of the document containing the pants item at a specific point in time, use the following operation:

let time = new Date("2022-11-16T13:00:00");
db.prices.find( {
'item': 'pants',
'validFrom': { '$lte': time },
'validTo': { '$gt': time }
} );

Tracking Changes in Few Fields

If you only need to track changes over time to few fields in a document, you can use SCD type 3 by embedding the history of a field as an array in the first document.

For example, the following aggregation pipeline updates the price in the document representing pants to 7 and stores the previous value of the price with a timestamp of when the previous price became invalid in an array called priceHistory:

db.prices.aggregate( [
{ $match: { 'item': 'pants' } },
{ $addFields:
{ price: 7, priceHistory:
{ $concatArrays:
[
{ $ifNull: [ '$priceHistory', [] ] },
[ { price: "$price", time: now } ]
]
}
}
},
{ $merge:
{
into: "prices",
on: "_id",
whenMatched: "merge",
whenNotMatched: "fail"
}
}
] )

This solution can become slow or inefficient if your array size gets too large. To avoid large arrays, you can use the outlier or the bucket patterns to design your schema.

Outlook Data FederationOutlook数据联合

The above examples focus on a strict and accurate representation of document field changes. Sometimes, you might have less strict requirements on showing historical data. For example, you might have an application that only requires access to the current state of the data most of the time, but you must run some analytical queries on the full history of data.上述示例侧重于严格准确地表示文档字段更改。有时,您可能对显示历史数据的要求不那么严格。例如,您可能有一个应用程序,它在大多数情况下只需要访问数据的当前状态,但您必须对完整的数据历史运行一些分析查询。

In this case, you can store the current version of the data in one collection and the historical changes in another collection. You can then remove the historical collection from the active MongoDB cluster using the MongoDB Atlas Federated Database functionalities, and in the fully managed version using the Online Archive.在这种情况下,您可以将数据的当前版本存储在一个集合中,将历史更改存储在另一个集合。然后,您可以使用MongoDB Atlas Federated Database功能从活动的MongoDB集群中删除历史集合,并在完全托管的版本中使用Online Archive删除历史集合。

Other Use Cases其他用例

While slowly changing dimensions is helpful for data warehousing, you can also use the SCD framework in event-driven applications. If you have infrequent events in different types of categories, it is expensive to find the latest event per category, as the process could require grouping or sorting your data in order to find the current state.虽然缓慢改变维度有助于数据仓库,但您也可以在事件驱动的应用程序中使用SCD框架。如果您在不同类型的类别中有不频繁的事件,那么为每个类别查找最新事件的成本很高,因为该过程可能需要对数据进行分组或排序才能找到当前状态。

In the case of infrequent events, you can amend the data model by adding a field to store the time of the next event, in addition to the event time per document. The new date field ensures that if you execute a search for a specific point in time, you can easily and efficiently retrieve the respective event you are searching for.在事件不频繁的情况下,除了每个文档的事件时间外,您还可以通过添加一个字段来存储下一个事件的时间,从而修改数据模型。新的日期字段可确保,如果您对特定时间点执行搜索,则可以轻松有效地检索您正在搜索的相应事件。