Database Manual / CRUD Operations

Natural Language to MongoDB Queries自然语言到MongoDB查询

This page provides guidance on how to generate MongoDB queries for your data from natural language using a large language model (LLM).本页提供了如何使用大型语言模型(LLM)从自然语言为数据生成MongoDB查询的指导。

For example, consider the following natural language query to generated query in mongosh for the Atlas sample_mflix database:例如,考虑以下自然语言查询来生成Atlas sample_mflix数据库的mongosh查询:

Given the following natural language query:给定以下自然语言查询:

Show me the genres and runtime of
10 movies from 2015 that have
the most comments

This generates the following mongosh code:这将生成以下mongosh代码:

db.movies.aggregate([
{
$match: {
year: 2015,
},
},
{
$sort: {
num_mflix_comments: -1,
},
},
{
$limit: 10,
},
{
$project: {
_id: 0,
genres: 1,
runtime: 1,
},
},
]);

Available Methods可用方法

In addition to using LLMs out of the box, you can use the following tools built by MongoDB to generate MongoDB queries from natural language:除了使用开箱即用的LLM外,您还可以使用MongoDB构建的以下工具从自然语言生成MongoDB查询:

Selecting a Model选择模型

Models that perform well on general tasks typically also perform well at MongoDB query generation. When selecting an LLM to generate MongoDB queries, refer to popular benchmarks like MMLU-Pro and Chatbot Arena ELO to evaluate performance between models.在一般任务上表现良好的模型通常在MongoDB查询生成方面也表现良好。当选择LLM来生成MongoDB查询时,请参考MMLU-ProChatbot Arena ELO等流行的基准测试来评估模型之间的性能。

Effective Prompting有效激励

This section outlines effective strategies for prompting an LLM to generate MongoDB queries.本节概述了提示LLM生成MongoDB查询的有效策略。

Note

The following prompting strategies are based on benchmarks created by MongoDB. 以下提示策略基于MongoDB创建的基准。To learn more, see our public benchmark of natural language to mongosh code on Hugging Face.要了解更多信息,请在Hugging Face上查看自然语言到蒙古语代码的公共基准测试。

Base Prompt基础提示

Your base prompt, also called the system prompt, should provide a clear overview of your task, including:基本提示,也称为系统提示,应提供任务的清晰概述,包括:

  • The type of query to generate.要生成的查询类型。
  • Information about the expected output structure, such as the driver language or tool that executes the query.关于预期输出结构的信息,例如执行查询的驱动程序语言或工具。

The following base prompt example demonstrates how to generate a MongoDB read operation or aggregation for mongosh:以下基本提示示例演示了如何为mongosh生成MongoDB读取操作或聚合:

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.

Format the mongosh query in the following structure:

`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`

General Guidance通用指南

To improve query quality, add the following guidance to your base prompt to provide the model with common tips for generating effective MongoDB queries:为了提高查询质量,请在基本提示中添加以下指导,为模型提供生成有效MongoDB查询的常见提示:

Some general query-authoring tips:

1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate)
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.)
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible
4. Include sorting (.sort()) and limiting (.limit()), when appropriate, for result set management
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`.
8. For Decimal128 operations, prefer range queries over exact equality
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks

Chain of Thought思维链

You can prompt the model to "think out loud" before generating the response to improve response quality. This technique, called chain of thought prompting, improves performance but increases generation time and costs.您可以在生成响应之前提示模型“大声思考”,以提高响应质量。这种技术被称为思维链提示,可以提高性能,但会增加生成时间和成本。

To encourage the model to think step-by-step before generating the query, add the following text to your base prompt:为了鼓励模型在生成查询之前逐步思考,请在基本提示中添加以下文本:

Think step by step about the code in the answer before providing it. In your thoughts, consider:
1. Which collections are relevant to the query.
2. Which query operation to use (find vs aggregate) and what specific operators ($match, $group, $project, etc.) are needed.
3. What fields are relevant to the query.
4. Which indexes you can use to improve performance.
5. What specific transformations or projections are required.
6. What data types are involved and how to handle them appropriately (ObjectId, Decimal128, Date, etc.).
7. What edge cases to consider (empty results, null values, missing fields).
8. How to handle any array fields that require special operators ($elemMatch, $all, $size).
9. Any other relevant considerations.

Include Sample Documents包括示例文件

To significantly improve query quality, include a few representative sample documents from your collection. Two to three representative documents typically provide the model with sufficient context about the data structure.为了显著提高查询质量,请从集合中包含一些具有代表性的示例文档。两到三个代表性文档通常为模型提供有关数据结构的足够上下文。

When providing sample documents, follow these guidelines:在提供示例文档时,请遵循以下指南:

  • Use the BSON.EJSON.serialize() function to convert BSON documents to EJSON strings for the prompt.使用BSON.EJSON.serialize()函数将BSON文档转换为提示符的EJSON字符串。
  • Truncate long fields or deeply nested objects.截断长字段或深度嵌套的对象。
  • Exclude long string values.排除长字符串值。
  • For large arrays, like vector embeddings, include only a few elements.对于大型数组,如向量嵌入,只包含几个元素。

Sample Documents Example示例文档

Sample documents for the movies collection to include in a prompt提示中包含的电影集示例文档

For example, for the sample_mflix database and movies collection, you might include the following documents in your prompt:例如,对于sample_mflix数据库和movies集合,您可以在提示符中包含以下文档:

[
{
_id: {
$oid: "573a13bbf29313caabd526d0",
},
plot: "Van Erp shows us what the Dutch do in their spare time and takes a look at the industry behind all t...",
genres: ["Documentary"],
runtime: 90,
title: "Pretpark Nederland",
num_mflix_comments: 0,
poster:
"https://m.media-amazon.com/images/M/MV5BMTUwNjU0ODg3N15BMl5BanBnXkFtZTcwMzg3NjYxNA@@._V1_SY1000_SX67...",
countries: ["Netherlands"],
fullplot:
"Van Erp displays the mechanics behind the Dutch tourism industry. Key figures behind events and dest...",
languages: ["Dutch", "Mandarin"],
released: {
$date: "2006-10-18T00:00:00.000Z",
},
directors: ["Michiel van Erp"],
writers: ["Renè van 't Erve (scenario)", "Michiel van Erp (scenario)"],
awards: {
wins: 0,
nominations: 1,
text: "1 nomination.",
},
lastupdated: "2015-02-26T00:48:24.883Z",
year: 2006,
imdb: {
rating: 7.3,
votes: 237,
id: 882800,
},
type: "movie",
tomatoes: {
viewer: {
rating: 2.2,
numReviews: 19,
},
dvd: {
$date: "2010-06-22T00:00:00.000Z",
},
lastUpdated: {
$date: "2014-11-24T14:15:50.000Z",
},
},
hash: {
low: -1866172407,
high: -2147460187,
unsigned: false,
},
},
{
_id: {
$oid: "573a13caf29313caabd7c4e0",
},
fullplot:
"A drama centered on a rising country-music songwriter (Hedlund) who sparks with a fallen star (Paltr...",
imdb: {
rating: 6.3,
votes: 14066,
id: 1555064,
},
year: 2010,
plot: "A rising country-music songwriter works with a fallen star to work their way fame, causing romantic ...",
genres: ["Drama", "Music"],
rated: "PG-13",
metacritic: 45,
title: "Country Strong",
lastupdated: "2015-09-03T00:39:54.710Z",
languages: ["English"],
writers: ["Shana Feste"],
type: "movie",
tomatoes: {
website: "http://www.countrystrong-movie.com/?hs308=CST6186",
viewer: {
rating: 3.3,
numReviews: 32825,
meter: 53,
},
dvd: {
$date: "2011-04-12T00:00:00.000Z",
},
critic: {
rating: 4.5,
numReviews: 130,
meter: 22,
},
boxOffice: "$20.2M",
consensus:
"The cast gives it their all, and Paltrow handles her songs with aplomb, but Country Strong's cliched...",
rotten: 101,
production: "Screen Gems",
lastUpdated: {
$date: "2015-08-17T18:04:40.000Z",
},
fresh: 29,
},
poster:
"https://m.media-amazon.com/images/M/MV5BMTUxMjQ0NjE3OV5BMl5BanBnXkFtZTcwODIxNDEwNA@@._V1_SY1000_SX67...",
num_mflix_comments: 0,
released: {
$date: "2011-01-07T00:00:00.000Z",
},
awards: {
wins: 2,
nominations: 6,
text: "Nominated for 1 Oscar. Another 1 win & 6 nominations.",
},
countries: ["USA"],
cast: [
"Gwyneth Paltrow",
"Tim McGraw",
"Garrett Hedlund",
"...and 1 more items",
],
},
];

Best Practices最佳实践

Apply the following prompting best practices for specific use cases when generating MongoDB queries from natural language.在从自然语言生成MongoDB查询时,针对特定用例应用以下提示最佳实践。

Include Index Information包含索引信息

Include collection indexes in your prompt to encourage the LLM to generate more performant queries. 在提示中包含集合索引,以鼓励LLM生成更高性能的查询。MongoDB drivers and mongosh provide methods to get index information. MongoDB驱动程序和mongosh提供了获取索引信息的方法。For example, the Node.js driver provides the listIndexes() method to get indexes for your prompt.例如,Node.js驱动程序提供了listIndexes()方法来获取提示符的索引。

Time-Based Queries基于时间的查询

Most LLM tools include the date in their system prompt. However, if you're using an LLM out of the box, the model does not know the current date or time. Therefore, when working with base models or building your own natural language to MongoDB tools, include the latest date in your prompt. 大多数LLM工具在其系统提示中都包含日期。但是,如果您使用的是开箱即用的LLM,则模型不知道当前日期或时间。因此,在使用基础模型或构建自己的自然语言到MongoDB工具时,请在提示中包含最新日期。Use the method for your programming language to get the current date as a string such as JavaScript's new Date().toString() or Python's str(datetime.now()).使用编程语言的方法将当前日期作为字符串获取,例如JavaScript的new Date().toString()或Python的str(datetime.now())

Annotated Database Schemas带注释的数据库架构

Include annotated schemas of relevant database collections in your prompt. While no single representation method works best for all LLMs, some approaches are more effective than others.在提示中包含相关数据库集合的带注释模式。虽然没有一种表示方法对所有LLM都最有效,但有些方法比其他方法更有效。

We recommend representing collections using programming language-native types that describe data shape, such as TypeScript Types, Python Pydantic models, or Go structs. 我们建议使用描述数据形状的编程语言本机类型来表示集合,例如TypeScript类型、Python Pydantic模型或Go结构。If you use MongoDB from these languages, you likely have the data shape defined already. To guide the LLM and reduce ambiguity, add comments to your prompt to describe each field.如果你从这些语言中使用MongoDB,你可能已经定义了数据形状。为了指导LLM并减少歧义,请在提示中添加注释来描述每个字段。

The following example shows a TypeScript type for the sample_mflix.movies collection:以下示例显示了sample_mflix.movies集合的TypeScript类型:

TypeScript Schema ExampleTypeScript模式示例

Annotated TypeScript schema example for the sample_mflix.movies collectionsample_mflixmovies集合的带注释的TypeScript模式示例

interface Movie {
/**
* Unique identifier for the movie document.
*/
_id: ObjectId;
/**
* Brief description of the movie's plot.
*/
plot: string;
/**
* List of genres associated with the movie.
*/
genres: string[];
/**
* Duration of the movie in minutes.
*/
runtime: number;
/**
* Title of the movie.
*/
title: string;
/**
* Number of comments on the movie in the mflix system.
*/
num_mflix_comments: number;
/**
* URL to the movie's poster image.
*/
poster: string;
/**
* List of countries where the movie was produced.
*/
countries: string[];
/**
* Detailed description of the movie's plot.
*/
fullplot: string;
/**
* Languages spoken in the movie.
*/
languages: string[];
/**
* Release date of the movie.
*/
released: Date;
/**
* List of directors of the movie.
*/
directors: string[];
/**
* List of writers of the movie.
*/
writers: string[];
/**
* Awards received by the movie.
*/
awards: {
/**
* Number of awards won by the movie.
*/
wins: number;
/**
* Number of award nominations received by the movie.
*/
nominations: number;
/**
* Textual description of the awards.
*/
text: string;
};
/**
* Last updated timestamp for the movie document.
*/
lastupdated: string;
/**
* Year the movie was released.
*/
year: number;
/**
* IMDb information for the movie.
*/
imdb: {
/**
* IMDb rating of the movie.
*/
rating: number;
/**
* Number of votes the movie received on IMDb.
*/
votes: number;
/**
* IMDb identifier for the movie.
*/
id: number;
};
/**
* Type of the movie (e.g., movie, series).
*/
type: string;
/**
* Rotten Tomatoes information for the movie.
*/
tomatoes: {
/**
* Viewer ratings on Rotten Tomatoes.
*/
viewer?: {
/**
* Viewer rating score.
*/
rating: number;
/**
* Number of reviews by viewers.
*/
numReviews: number;
/**
* Viewer meter score.
*/
meter: number;
};
/**
* DVD release date.
*/
dvd?: Date;
/**
* Last updated timestamp for Rotten Tomatoes data.
*/
lastUpdated?: Date;
/**
* Official website for the movie.
*/
website?: string;
/**
* Critic ratings on Rotten Tomatoes.
*/
critic?: {
/**
* Critic rating score.
*/
rating: number;
/**
* Number of reviews by critics.
*/
numReviews: number;
/**
* Critic meter score.
*/
meter: number;
};
/**
* Box office earnings.
*/
boxOffice?: string;
/**
* Consensus statement from Rotten Tomatoes.
*/
consensus?: string;
/**
* Number of rotten reviews.
*/
rotten?: number;
/**
* Production company.
*/
production?: string;
/**
* Number of fresh reviews.
*/
fresh?: number;
};
/**
* Hash value for the movie document.
*/
hash: Long;
/**
* MPAA rating of the movie.
*/
rated?: string;
/**
* Metacritic score of the movie.
*/
metacritic?: number;
/**
* List of main cast members in the movie.
*/
cast: string[];
}

Prompt Template提示模板

The following example demonstrates a complete prompt using the strategies described on this page for generating mongosh code from natural language.以下示例演示了使用本页所述策略从自然语言生成mongosh代码的完整提示。

Base Prompt Example基本提示示例

Use the following system prompt example as a template for your MongoDB query generation tasks. The sample prompt includes the following components:使用以下系统提示示例作为MongoDB查询生成任务的模板。示例提示包括以下组件:

  • Task overview and expected output format任务概述和预期输出格式
  • General MongoDB query authoring guidanceMongoDB查询编写通用指南
You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.

Format the mongosh query in the following structure:

`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`

Some general query-authoring tips:

1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate).
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.).
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible.
4. Include sorting (.sort()) and limiting (.limit()) when appropriate for result set management.
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays.
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null.
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. Use the provided 'Latest Date' field to inform dates in queries.
8. For Decimal128 operations, prefer range queries over exact equality.
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks.

Note

You might also add the chain-of-thought prompt to encourage step-by-step thinking before code generation.您还可以添加思维链提示,以鼓励在代码生成之前逐步思考。

User Message Template用户消息模板

Then, use the following user message template to provide the model with the necessary context about your database and your desired query:然后,使用以下用户消息模板为模型提供有关数据库和所需查询的必要上下文:

Generate MongoDB Shell (mongosh) queries for the following database and natural language query:

## Database Information

Name: {{Database name}}
Description: {{database description}}
Latest Date: {{latest date}} (use this to inform dates in queries)

### Collections

#### Collection `{{collection name. Do for each collection you want to query over}}`
Description: {{collection description}}

Schema:
```
{{interpreted or annotated schema here}}
```

Example documents:
```
{{truncated example documents here}}
```

Indexes:
```
{{collection index descriptions here}}
```

Natural language query: {{Natural language query here}}