Database Manual / Reference / Query Language / Expressions

$strLenBytes (expression operator)(表达式运算符)

Definition定义

$strLenBytes

Returns the number of UTF-8 encoded bytes in the specified string.返回指定字符串中UTF-8编码的字节数。

$strLenBytes has the following operator expression syntax:$strLenBytes具有以下运算符表达式语法

{ $strLenBytes: <string expression> }

The argument can be any valid expression as long as it resolves to a string. 参数可以是任何有效的表达式,只要它解析为字符串即可。For more information on expressions, see Expressions.有关表达式的详细信息,请参阅表达式

If the argument resolves to a value of null or refers to a missing field, $strLenBytes returns an error.如果参数解析为null值或引用缺少的字段,$strLenBytes将返回错误。

Behavior行为

The $strLenBytes operator counts the number of UTF-8 encoded bytes in a string where each character may use between one and four bytes.$strLenBytes运算符计算字符串中UTF-8编码字节的数量,其中每个字符可以使用1到4个字节。

For example, US-ASCII characters are encoded using one byte. Characters with diacritic markings and additional Latin alphabetical characters (Latin characters outside of the English alphabet) are encoded using two bytes. Chinese, Japanese and Korean characters typically require three bytes, and other planes of unicode (emoji, mathematical symbols, etc.) require four bytes.例如,US-ASCII字符使用一个字节进行编码。带有变音标记的字符和其他拉丁字母字符(英语字母表之外的拉丁字符)使用两个字节进行编码。中文、日文和韩文字符通常需要三个字节,而unicode的其他平面(表情符号、数学符号等)需要四个字节。

The $strLenBytes operator differs from $strLenCP operator which counts the code points in the specified string regardless of how many bytes each character uses.$strLenBytes运算符不同于$strLenCP运算符,后者计算指定字符串中的代码点,而不管每个字符使用多少字节。

Example示例Results结果Notes备注
{ $strLenBytes: "abcde" }
5Each character is encoded using one byte.每个字符都使用一个字节进行编码。
{ $strLenBytes: "Hello World!" }
12Each character is encoded using one byte.每个字符都使用一个字节进行编码。
{ $strLenBytes: "cafeteria" }
9Each character is encoded using one byte.每个字符都使用一个字节进行编码。
{ $strLenBytes: "cafétéria" }
11é is encoded using two bytes.é使用两个字节进行编码。
{ $strLenBytes: "" }
0Empty strings return 0.空字符串返回0。
{ $strLenBytes: "$€λG" }
7 is encoded using three bytes. λ is encoded using two bytes.使用三个字节进行编码。λ使用两个字节进行编码。
{ $strLenBytes: "寿司" }
6Each character is encoded using three bytes.每个字符使用三个字节进行编码。

Example示例

Single-Byte and Multibyte Character Set单字节和多字节字符集

Create a food collection with the following documents:使用以下文档创建食物集合:

db.food.insertMany(
[
{ "_id" : 1, "name" : "apple" },
{ "_id" : 2, "name" : "banana" },
{ "_id" : 3, "name" : "éclair" },
{ "_id" : 4, "name" : "hamburger" },
{ "_id" : 5, "name" : "jalapeño" },
{ "_id" : 6, "name" : "pizza" },
{ "_id" : 7, "name" : "tacos" },
{ "_id" : 8, "name" : "寿司" }
]
)

The following operation uses the $strLenBytes operator to calculate the length of each name value:以下操作使用$strLenBytes运算符计算每个name值的length

db.food.aggregate(
[
{
$project: {
"name": 1,
"length": { $strLenBytes: "$name" }
}
}
]
)

The operation returns the following results:该操作返回以下结果:

{ "_id" : 1, "name" : "apple", "length" : 5 }
{ "_id" : 2, "name" : "banana", "length" : 6 }
{ "_id" : 3, "name" : "éclair", "length" : 7 }
{ "_id" : 4, "name" : "hamburger", "length" : 9 }
{ "_id" : 5, "name" : "jalapeño", "length" : 9 }
{ "_id" : 6, "name" : "pizza", "length" : 5 }
{ "_id" : 7, "name" : "tacos", "length" : 5 }
{ "_id" : 8, "name" : "寿司", "length" : 6 }

The documents with _id: 3 and _id: 5 each contain a diacritic character (é and ñ respectively) that requires two bytes to encode. The document with _id: 8 contains two Japanese characters that are encoded using three bytes each. _id: 3_id: 5的文档均包含一个需要两个字节编码的变音字符(分别为éñ)。_id: 8的文档包含两个日文字符,每个字符使用三个字节进行编码。This makes the length greater than the number of characters in name for the documents with _id: 3, _id: 5 and _id: 8.这使得length大于_id: 3_id: 5_id: 8的文档名称中的字符数。