Mask Sensitive Fields
Minimum MongoDB Version: 4.4 (due to use of $rand
operator)
Scenario情形
You want to perform irreversible masking on the sensitive fields of a collection of credit card payments, ready to provide the output data set to a 3rd party for analysis, without exposing sensitive information to the 3rd party. 您希望对信用卡付款集合的敏感字段执行不可逆的屏蔽,准备将输出数据集提供给第三方进行分析,而不会向第三方暴露敏感信息。The specific changes that you need to make to the payments' fields are:您需要对付款字段进行的具体更改包括:
Partially obfuscate the card holder's name部分混淆持卡人的姓名Obfuscate the first 12 digits of the card's number, retaining only the final 4 digits混淆卡号的前12位,只保留最后4位Adjust the card's expiry date-time by adding or subtracting a random amount up to a maximum of 30 days (~1 month)通过随机添加或减去最多30天(1个月)的金额来调整卡的到期日时间Replace the card's 3 digit security code with a random set of 3 digits用一组随机的3位数字替换卡的3位安全代码Adjust the transaction's amount by adding or subtracting a random amount up to a maximum of 10% of the original amount通过添加或减去随机金额来调整事务金额,最多不超过原始金额的10%Change the对于大约20%的记录,将reported
field's boolean value to the opposite value for roughly 20% of the recordsreported
字段的布尔值更改为相反的值If the embedded如果嵌入的customer_info
sub-document'scategory
field is set to RESTRICTED, exclude the wholecustomer_info
sub-documentcustomer_info
子文档的category
字段设置为RESTRICTED,则排除整个customer_info
个子文档
Sample Data Population样本数据总体
Drop any old version of the database (if it exists) and then populate a new 删除数据库的任何旧版本(如果存在),然后用包含敏感数据的2个信用卡付款文档填充新的payments
collection with 2 credit card payment documents, containing sensitive data:payments
集合:
db = db.getSiblingDB("book-mask-sensitive-fields");
db.dropDatabase();
// Insert records into the payments collection将记录插入付款集合
db.payments.insertMany([
{
"card_name": "Mrs. Jane A. Doe",
"card_num": "1234567890123456",
"card_expiry": ISODate("2023-08-31T23:59:59Z"),
"card_sec_code": "123",
"card_type": "CREDIT",
"transaction_id": "eb1bd77836e8713656d9bf2debba8900",
"transaction_date": ISODate("2021-01-13T09:32:07Z"),
"transaction_amount": NumberDecimal("501.98"),
"reported": false,
"customer_info": {
"category": "RESTRICTED",
"rating": 89,
"risk": 3,
},
},
{
"card_name": "Jim Smith",
"card_num": "9876543210987654",
"card_expiry": ISODate("2022-12-31T23:59:59Z"),
"card_sec_code": "987",
"card_type": "DEBIT",
"transaction_id": "634c416a6fbcf060bb0ba90c4ad94f60",
"transaction_date": ISODate("2020-11-24T19:25:57Z"),
"transaction_amount": NumberDecimal("64.01"),
"reported": true,
"customer_info": {
"category": "NORMAL",
"rating": 78,
"risk": 55,
},
},
]);
Aggregation Pipeline聚合管道
Define a pipeline ready to perform the aggregation:定义准备执行聚合的管道:
var pipeline = [
// Replace a subset of fields with new values用新值替换字段子集
{"$set": {
// Extract the last word from the name ,从名称中提取最后一个单词, eg: 'Doe' from 'Mrs. Jane A. Doe'
"card_name": {"$regexFind": {"input": "$card_name", "regex": /(\S+)$/}},
// Mask card num 1st part retaining last 4 chars,掩码卡编号保留最后4个字符的第一部分, eg: '1234567890123456' -> 'XXXXXXXXXXXX3456'
"card_num": {"$concat": [
"XXXXXXXXXXXX",
{"$substrCP": ["$card_num", 12, 4]},
]},
// Add/subtract a random time amount of a maximum of 30 days (~1 month) each-way每次最多加/减30天(1个月)的随机时间量
"card_expiry": {"$add": [
"$card_expiry",
{"$floor": {"$multiply": [{"$subtract": [{"$rand": {}}, 0.5]}, 2*30*24*60*60*1000]}},
]},
// Replace each digit with random digit,用随机数字替换每个数字, eg: '133' -> '472'
"card_sec_code": {"$concat": [
{"$toString": {"$floor": {"$multiply": [{"$rand": {}}, 10]}}},
{"$toString": {"$floor": {"$multiply": [{"$rand": {}}, 10]}}},
{"$toString": {"$floor": {"$multiply": [{"$rand": {}}, 10]}}},
]},
// Add/subtract a random percent of the amount's value up to 10% maximum each-way每次加/减金额的随机百分比,最高可达10%
"transaction_amount": {"$add": [
"$transaction_amount",
{"$multiply": [{"$subtract": [{"$rand": {}}, 0.5]}, 0.2, "$transaction_amount"]},
]},
// Retain field's bool value 80% of time on average, setting to the opposite value 20% of time平均80%的时间保留字段的布尔值,20%的时间设置为相反的值
"reported": {"$cond": {
"if": {"$lte": [{"$rand": {}}, 0.8]},
"then": "$reported",
"else": {"$not": ["$reported"]},
}},
// Exclude sub-doc if the sub-doc's category field's value is 'RESTRICTED'如果子单据的类别字段的值为“RESTRICTED”,则排除子单据
"customer_info": {"$cond": {
"if": {"$eq": ["$customer_info.category", "RESTRICTED"]},
"then": "$$REMOVE",
"else": "$customer_info",
}},
// Mark _id field to excluded from results将_id
字段标记为从结果中排除
"_id": "$$REMOVE",
}},
// Take regex matched last word from the card name and prefix it with hardcoded value从卡片名称中提取regex匹配的最后一个单词,并在其前面加上硬编码值
{"$set": {
"card_name": {"$concat": ["Mx. Xxx ", {"$ifNull": ["$card_name.match", "Anonymous"]}]},
}},
];
Execution执行
Execute the aggregation using the defined pipeline and also view its explain plan:使用定义的管道执行聚合,并查看其解释计划:
db.payments.aggregate(pipeline);
db.payments.explain("executionStats").aggregate(pipeline);
Expected Results预期结果
Two documents should be returned, corresponding to the original two source documents, but this time with many of their fields redacted and obfuscated, plus the 应返回两个文档,对应于原始的两个源文档,但这一次,它们的许多字段被编辑和模糊,加上一条记录中省略了customer_info
embedded document omitted for one record due to it having been marked as RESTRICTED
, as shown below:customer_info
嵌入文档,因为它被标记为RESTRICTED
,如下所示:
[
{
card_name: 'Mx. Xxx Doe',
card_num: 'XXXXXXXXXXXX3456',
card_expiry: ISODate('2023-08-31T23:29:46.460Z'),
card_sec_code: '295',
card_type: 'CREDIT',
transaction_id: 'eb1bd77836e8713656d9bf2debba8900',
transaction_date: ISODate('2021-01-13T09:32:07.000Z'),
transaction_amount: NumberDecimal('492.4016988351474881660000000000000'),
reported: false
},
{
card_name: 'Mx. Xxx Smith',
card_num: 'XXXXXXXXXXXX7654',
card_expiry: ISODate('2023-01-01T00:34:49.330Z'),
card_sec_code: '437',
card_type: 'DEBIT',
transaction_id: '634c416a6fbcf060bb0ba90c4ad94f60',
transaction_date: ISODate('2020-11-24T19:25:57.000Z'),
transaction_amount: NumberDecimal('58.36081337486762223600000000000000'),
reported: false,
customer_info: { category: 'NORMAL', rating: 78, risk: 55 }
}
]
Observations观察
-
Targeted Redaction.有针对性的补救措施。The pipeline uses a如果类别字段等于$cond
operator to return the$$REMOVE
marker variable if thecategory
field is equal toRESTRICTED
.RESTRICTED
,则管道使用$cond
运算符返回$$REMOVE
标记变量。This informs the aggregation engine to exclude the whole这会通知聚合引擎将整个customer_info
sub-document from the stage's output for the record.customer_info
子文档从记录的阶段输出中排除。Alternatively, the pipeline could have used a或者,管道可以使用$redact
stage to achieve the same.$redact
阶段来实现同样的目的。However,然而,$redact
typically has to perform more processing work due to needing to check every field in the document.$redact
通常需要执行更多的处理工作,因为需要检查文档中的每个字段。Hence, if a pipeline is only to redact out one specific sub-document, use the approach outlined in this example.因此,如果管道只编辑出一个特定的子文档,请使用本例中概述的方法。 -
Regular Expression.正则表达式。For masking the为了屏蔽card_name
field, a regular expression operator is used to extract the last word of the field's original value.card_name
字段,使用正则表达式运算符提取字段原始值的最后一个字。$regexFind
returns metadata into the stage's output records, indicating if the match succeeded and what the matched value is.将元数据返回到阶段的输出记录中,指示匹配是否成功以及匹配的值是什么。Therefore, an additional因此,稍后还需要一个额外的$set
stage is required later in the pipeline to extract the actual matched word from this metadata and prefix it with some hard-coded text.$set
阶段来从元数据中提取实际匹配的单词,并在其前面加上一些硬编码的文本。MongoDB version 5.0 introduced a newMongoDB版本5.0引入了一个新的$getField
operator, which you can instead use to directly extract the "regex" result field (match
).$getField
运算符,您可以使用它直接提取“regex”结果字段(match)。Consequently, if you are using MongoDB 5.0 or greater, you can eliminate the second因此,如果您使用的是MongoDB 5.0或更高版本,您可以从管道的末尾消除第二个$set
stage from the end of your pipeline and then replace the line of code which sets the masked value of thecard_name
field to the following:$set
阶段,然后替换将card_name
字段的掩码值设置为以下值的代码行:// Prefix with a hard-coded value followed by the regex extracted last word of the card name "card_name": {"$concat": ["Mx. Xxx ", {"$ifNull": [{"$getField": {"field": "match", "input": {"$regexFind": {"input": "$card_name", "regex": /(\S+)$/}}}}, "Anonymous"]}]},
-
Meaningful Insight.有意义的见解。Even though the pipeline is irreversibly obfuscating fields, it doesn't mean that the masked data is useless for performing analytics to gain insight.尽管管道正在不可逆转地混淆字段,但这并不意味着屏蔽的数据对执行分析以获得洞察力毫无用处。The pipeline masks some fields by fluctuating the original values by a small but limited random percentage (e.g.管道通过将原始值波动一个小但有限的随机百分比(例如card_expiry
,transaction_amount
), rather than replacing them with completely random values (e.g.card_sec_code
).card_expiry
、transaction_aunt
)来屏蔽一些字段,而不是用完全随机的值替换它们(例如card_sec_code
)。In such cases, if the input data set is sufficiently large, then minor variances will be equalled out.在这种情况下,如果输入数据集足够大,则较小的方差将相等。For the fields that are only varied slightly, users can derive similar trends and patterns from analysing the masked data as they would the original data.对于变化不大的字段,用户可以通过分析屏蔽数据得出与原始数据相似的趋势和模式。