Database Manual / Time Series

About Time Series Data关于时间序列数据

Internally, MongoDB optimizes time series data by grouping documents in a time series collection based on common metaField values. Selecting a useful one provides significant optimizations to storage density and query performance. 在内部,MongoDB通过基于公共metaField值对时间序列集合中的文档进行分组来优化时间序列数据。选择一个有用的选项可以显著优化存储密度和查询性能。For more information, see metaFields.有关详细信息,请参阅metaFields

Properties of Time Series Data时间序列数据的性质

Time series data has several properties that differentiate it from other data formats:时间序列数据有几个特性,使其与其他数据格式不同:

  • Documents arrive in order, requiring frequent insert operations to append them.文档按顺序到达,需要频繁的插入操作来附加它们。
  • Update operations are rare, since each document represents a single point in time.更新操作很少,因为每个文档代表一个时间点。
  • Delete operations are rare if your application benefits from having an extensive historical record.如果应用程序受益于广泛的历史记录,则很少进行删除操作。
  • Data is indexed by time and an identifier, such as a stock ticker, that identifies the unique time series it belongs to.数据按时间和标识符(如股票行情)进行索引,该标识符标识了它所属的唯一时间序列。
  • Data is high volume, since each individual time series requires a large, constantly growing number of documents.数据量很大,因为每个单独的时间序列都需要大量不断增长的文档。
  • In time series collections, documents do not require a unique _id field because MongoDB does not create an index on the _id field.时间序列集合中,文档不需要唯一的_id字段,因为MongoDB不会在_id字段上创建索引。

To account for these factors, MongoDB uses a specialized columnar format that groups documents from each time series together. This has the following benefits:为了考虑这些因素,MongoDB使用了一种专门的列格式,将每个时间序列的文档分组在一起。这有以下好处:

  • Reduced storage and index sizes减少存储和索引大小
  • Improved query efficiency提高查询效率
  • Reduced I/O for read operations读取操作的I/O减少
  • Increased usage of the WiredTiger in-memory cache, further improving query speed增加了内存缓存中WiredTiger的使用,进一步提高了查询速度
  • Reduced complexity for working with time series data降低处理时间序列数据的复杂性

Comparing Time Series Collections to Regular Collections时间序列集合与常规集合的比较

In a regular collection, data is stored sequentially as blocks on disk, optimizing write speed. However, it requires one index for each data point, which quickly grows very large. 在常规集合中,数据作为块按顺序存储在磁盘上,优化了写入速度。然而,它要求每个数据点都有一个索引,这个索引会迅速增长到非常大。It also requires a second index that contains the time series identifier and timestamps themselves, so that users can query for a single series. To read this data, MongoDB has to process all of the database and disk blocks that contain it, even if a block only contains a single relevant document.它还需要第二个索引,其中包含时间序列标识符和时间戳本身,以便用户可以查询单个序列。为了读取这些数据,MongoDB必须处理包含它的所有数据库和磁盘块,即使一个块只包含一个相关文档。

This model is optimized for CRUD operations and frequent updates. A bank account balance only needs to reflect the current state, so each account holder's document is updated as that information changes.该模型针对CRUD操作和频繁更新进行了优化。银行账户余额只需要反映当前状态,因此每个账户持有人的文件都会随着信息的变化而更新。

Compare this to a time series collection. Time series collections write data in order, meaning recent transactions can be kept in memory for much faster retrieval. Since they are written in sequence, documents are stored together, so there's no need to read every disk block after a document is no longer in memory. 将其与时间序列集合进行比较。时间序列集合按顺序写入数据,这意味着最近的事务可以保存在内存中,以便更快地检索。由于它们是按顺序写入的,文档被存储在一起,因此在文档不再在内存中后,不需要读取每个磁盘块。Data is indexed by the metaField, leading to much smaller indexes.数据由metaField索引,从而产生更小的索引。

How Bucketing Works桶的工作原理

When you create a time series collection, MongoDB automatically creates groups, or buckets, of documents. MongoDB groups documents that have both:当您创建时间序列集合时,MongoDB会自动创建文档组或桶。MongoDB将同时具有以下两种功能的文档分组:

  • An identical metaField value, which should uniquely identify a time series. If a metaField is an object or array, MongoDB groups only if all object fields or array elements match.一个相同的metaField值,它应该唯一标识一个时间序列。如果metaField是对象或数组,则MongoDB仅在所有对象字段或数组元素都匹配的情况下进行分组。
  • timeField values that are close together. 接近的timeField值。The time series collection's granularity, bucketMaxSpanSeconds, and bucketRoundingSeconds parameters control the time span covered by each bucket. 时间序列集合的granularitybucketMaxSpanSecondsbucketRoundingSeconds参数控制着每个桶覆盖的时间跨度。For more information, see Set Granularity for Time Series Data.有关详细信息,请参阅设置时间序列数据的粒度

For example, with a granularity of seconds, MongoDB buckets documents within the same hour. 例如,以seconds为粒度,MongoDB在同一小时内存储文档。If a bucket contains a document with a metaField value of sensorA and a timeField value of 2024-08-01T18:23:21Z, an incoming document with a metaField of sensorB goes into a separate bucket regardless of time. 如果一个桶包含一个metaField值为sensorAtimeField值为2024-08-01T18:18:23:21Z的文档,则无论时间如何,metaFieldsensorB的传入文档都会进入一个单独的桶。An incoming document from sensorA goes into the same bucket only if its timeField is between 2024-08-01T18:00:00Z and 2024-08-01T18:59:59Z.只有当来自sensorA的传入文档的timeField2024-08-01T18:00:00Z2024-08-02T18:18:59:59Z之间时,它才会进入同一个桶。

If a document with a time of 2023-03-27T16:24:35Z does not fit an existing bucket, MongoDB creates a new bucket with a minimum time of 2023-03-27T16:00:00Z and a maximum time of 2023-03-27T16:59:59Z.如果时间为2023-03-27T16:24:35Z的文档不适合现有的桶,MongoDB将创建一个最小时间为2023:03-27T16:00:00Z、最大时间为2023-103-27T16:39:59Z的新桶。

Note

You can modify a time series collection's granularity, but only to change from finer to coarser measurements, such as extending bucket coverage from minutes hours. This updates the collection's view definition, but doesn't change how data is stored across existing buckets.您可以修改时间序列集合的粒度,但只能从更精细的测量值更改为更粗略的测量值,例如将桶覆盖范围从几分钟扩展到几小时。这将更新集合的视图定义,但不会改变数据在现有桶中的存储方式。

How the metaField Affects Bucketing元字段如何影响桶

Because metaField values must match exactly to group documents, the number of buckets in a time series collection depends on the number of unique metaField values. 因为metaField值必须与组文档完全匹配,所以时间序列集合中的桶数取决于唯一metaField值的数量。Collections with fine-grained or changing metaField values generate many sparsely packed, short-lived buckets. This leads to decreased storage and query efficiency.具有细粒度或不断变化的metaField值的集合会生成许多稀疏、短命的桶。这会导致存储和查询效率降低。

For example, in the following document, metadata is a good choice of metaField since it makes it easy to query data from a given weather sensor. Using these fields, MongoDB buckets readings from a single sensor together.例如,在下面的文档中,metadatametaField的一个不错的选择,因为它使查询给定天气传感器的数据变得容易。使用这些字段,MongoDB将单个传感器的读数放在一起。

{
timestamp: ISODate("2021-05-18T00:00:00.000Z"),
metadata: { sensorId: 5578, type: 'temperature' },
temp: 12,
_id: ObjectId("62f11bbf1e52f124b84479ad")
}

The Bucket Catalog桶目录

The bucket catalog is a specialized in-memory cache in WiredTiger. It tracks buckets to minimize latency and coordinate concurrent writes.桶目录是WiredTiger中专门的内存缓存。它跟踪桶以最小化延迟并协调并发写入。

For each open bucket, the catalog maintains information such as the metaField, active writers, covered time span, number of documents, size, and recent operations. Because MongoDB creates separate buckets for documents with a different metaField, multiple buckets are typically open at the same time.对于每个打开的桶,目录都会维护metaField、活动写入者、覆盖的时间跨度、文档数量、大小和最近的操作等信息。因为MongoDB为具有不同metaField的文档创建了单独的桶,所以通常会同时打开多个桶。

To avoid inconsistencies caused by race conditions, buckets may be closed and removed from the bucket catalog when a conflicting operation is executed. Restarting mongod closes all buckets and resets the bucket catalog.为了避免竞争条件造成的不一致,当执行冲突操作时,可能会关闭桶并将其从桶目录中删除。重新启动mongod会关闭所有桶并重置桶目录。

Creation创造

  • MongoDB creates a new bucket if there isn't a suitable one for an incoming document. This occurs when any of the following are true:如果传入文档没有合适的桶,MongoDB会创建一个新的桶。当以下任何一项为真时,就会发生这种情况:

    • The document metaField doesn't match any active buckets.文档metaField与任何活动桶都不匹配。
    • The document timestamp is outside of the range of all active buckets.文档时间戳超出了所有活动桶的范围。
    • The document exceeds the remaining size or document limit of all active buckets.文档超出了所有活动桶的剩余大小或文档限制。

    The starting timestamp of a new bucket is rounded down based on the collection's granularity. 新桶的开始时间戳根据集合的粒度进行四舍五入。This handles cases where documents with out-of-order timestamps arrive in close succession.这可以处理具有乱序时间戳的文档连续到达的情况。

Closure闭包

MongoDB closes a bucket under any of the following circumstances:MongoDB在以下任何一种情况下都会关闭桶:

  • Time has moved forward or backward past the covered time span, as indicated by an incoming document timestamp that falls outside of the bucket's bounds. These bounds are determined by the collection's granularity setting.时间已向前或向后移动超过覆盖的时间跨度,如落在桶边界之外的传入文档时间戳所示。这些边界由集合的粒度设置决定。
  • The bucket has hit the document limit (default 1000).桶已达到文档限制(默认值1000)。
  • The bucket has exceeded its storage size limit. This happens when:桶已超过其存储大小限制。当发生以下情况时,会发生这种情况:

    • The size exceeds the allowed maximum (default 125KiB).大小超过了允许的最大值(默认值125KiB)。
    • The number of documents is below a minimum number (default 10) and the size is below 12MiB.文档数量低于最小值(默认值为10),大小低于12MiB。

      This is a set, internal limit that optimizes performance when data consists of fewer, larger documents.这是一组内部限制,当数据由更少、更大的文档组成时,可以优化性能。

    • The set of active buckets doesn't fit within the allowed storage engine cache size. You can review this information using the collStats database command.活动桶集不符合允许的存储引擎缓存大小。您可以使用collStats数据库命令查看此信息。
  • The bucket catalog exceeds its allowed total memory allocation (by default, 2.5% of available system memory)桶目录超出了其允许的总内存分配(默认情况下为可用系统内存的2.5%)
  • A conflicting operation, such as a chunk migration or update, changes a bucket's on-disk state.冲突操作(如块迁移或更新)会改变桶在磁盘上的状态。
  • mongod restarts. This closes all buckets.重新启动。这将关闭所有水桶。

Deletion删除

MongoDB deletes a bucket when:MongoDB在以下情况下删除桶:

  • Its maximum allowed timestamp is less than the current time minus the collection's expireAfterSeconds parameter. This is equivalent to a TTL collection's time to live.其允许的最大时间戳小于当前时间减去集合的expireAfterSeconds参数。这相当于TTL集合的生存时间。
  • A delete or db.collection.deleteMany() command deletes the last document in the bucket.deletedb.collection.deleteMany()命令删除桶中的最后一个文档。