Skip to content

Commit 80f0e46

Browse files
mongodbenBen Perlmutter
andauthored
(EAI-1230): Create MDB collection and indexes for Atlas Search benchmark (#865)
* dataset python script * fix * Fix dataset upload * new line EOFs * simpler imports --------- Co-authored-by: Ben Perlmutter <[email protected]>
1 parent 671a645 commit 80f0e46

File tree

9 files changed

+1779
-2
lines changed

9 files changed

+1779
-2
lines changed

packages/datasets/.env.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ HUGGINGFACE_DOCS_CONTENT_REPO="someuser/some-repo"
99
HUGGINGFACE_DOCS_CODE_EXAMPLE_REPO="someuser/some-repo"
1010
MONGODB_CONNECTION_URI="..."
1111
MONGODB_DATABASE_NAME="docs-chatbot-dev"
12-
MONGODB_TEXT_TO_CODE_CONNECTION_URI="..."
12+
MONGODB_TEXT_TO_CODE_CONNECTION_URI="..."
13+
MONGODB_ATLAS_SEARCH_CONNECTION_URI="..."

packages/datasets/.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
dataOut/
2-
__pycache__/
2+
__pycache__/

packages/datasets/.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.11

packages/datasets/README.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Datasets Package
2+
3+
This package provides utilities for importing, processing, and managing datasets used in the MongoDB Knowledge Service/Chatbot project. It contains both Node.js/TypeScript and Python implementations for various dataset operations.
4+
5+
## Overview
6+
7+
The datasets package is a hybrid TypeScript/Python package that handles:
8+
- Dataset ingestion from various sources (HuggingFace, Atlas, etc.)
9+
- Data processing and transformation pipelines
10+
- MongoDB import/export operations
11+
- Code example extraction and classification
12+
- Natural language query generation
13+
- Database metadata extraction
14+
15+
## Structure
16+
17+
### Node.js/TypeScript Components
18+
19+
Located in `/src/` directory:
20+
21+
- **Code Example Processing**: Extract and classify code examples from documentation
22+
- **Page Dataset**: Load and process page-based datasets
23+
- **Tree Generation**: Generate hierarchical data structures for NL queries
24+
- **Database Operations**: MongoDB schema generation and database analysis
25+
- **HuggingFace Integration**: Upload datasets to HuggingFace Hub
26+
- **Evaluation**: Braintrust integration for dataset evaluation
27+
28+
### Python/UV Components
29+
30+
Located in `/mongodb_datasets/` directory:
31+
32+
- **Wikipedia Import**: Import Wikipedia datasets from HuggingFace to MongoDB
33+
- **Atlas Search**: Configure and create Atlas Search indexes
34+
- **Configuration Management**: Environment variable and project configuration
35+
36+
## Installation & Setup
37+
38+
### Node.js Dependencies
39+
```bash
40+
npm install
41+
npm run build
42+
```
43+
44+
### Python Dependencies (using uv)
45+
```bash
46+
# Install Python dependencies
47+
uv sync
48+
49+
# Activate virtual environment
50+
source .venv/bin/activate # or .venv\Scripts\activate on Windows
51+
```
52+
53+
## Usage
54+
55+
### Node.js Scripts
56+
57+
The package provides numerous npm scripts for different dataset operations:
58+
59+
```bash
60+
# Build the project
61+
npm run ...
62+
```
63+
64+
### Python Scripts
65+
66+
The Python components provide CLI tools for dataset import operations:
67+
68+
```bash
69+
# Import Wikipedia dataset (all articles)
70+
uv run ...
71+
```
72+
73+
## Configuration
74+
75+
### Environment Variables
76+
77+
For required environment variables, see `.env.example` in project root.
78+
Create a `.env` file next to it with the required env vars.
79+
80+
## Development
81+
82+
### Testing
83+
```bash
84+
# Node.js tests
85+
npm run test
86+
87+
# Linting
88+
npm run lint
89+
npm run lint:fix
90+
```
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""MongoDB Datasets - Utilities for importing datasets into MongoDB."""
2+
3+
__version__ = "0.1.0"
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
/**
2+
Atlas Search index definitions for the `articles` collection.
3+
*/
4+
{
5+
"name": "article_search",
6+
"mappings": {
7+
"dynamic": false,
8+
/**
9+
Fields to index:
10+
- title: Full-text and autocomplete
11+
- text: Full-text
12+
- url: Exact and normalized
13+
*/
14+
"fields": {
15+
/**
16+
Title with both full-text and autocomplete capabilities
17+
*/
18+
"title": [
19+
{
20+
"type": "string",
21+
"analyzer": "lucene.standard"
22+
},
23+
/**
24+
Index optimized for autocomplete/type-ahead search.
25+
*/
26+
{
27+
"type": "autocomplete",
28+
"analyzer": "lucene.standard",
29+
/**
30+
Min length of n-grams indexed is 2 characters.
31+
*/
32+
"minGrams": 2,
33+
/**
34+
Max length of n-grams indexed is 15 characters.
35+
This is a reasonable compromise between search relevance, performance, and storage cost.
36+
*/
37+
"maxGrams": 15,
38+
/**
39+
Fold diacritics to their base characters, e.g., "á" -> "a".
40+
*/
41+
"foldDiacritics": true
42+
}
43+
],
44+
/**
45+
Full-text search over the `text` field, which contains the article content.
46+
*/
47+
"text": {
48+
"type": "string",
49+
"analyzer": "text_analyzer"
50+
},
51+
/**
52+
URL for filtering
53+
*/
54+
"url": [
55+
{
56+
/**
57+
For normalized, fuzzy, flexible matching
58+
*/
59+
"type": "string",
60+
"analyzer": "url_normalizer_analyzer"
61+
}
62+
]
63+
}
64+
},
65+
/**
66+
Analyzers configuration for better text processing
67+
*/
68+
"analyzers": [
69+
/**
70+
Optimized for text search over full documents in the `text` field
71+
*/
72+
{
73+
"name": "text_analyzer",
74+
"tokenizer": {
75+
/**
76+
Standard tokenizer.
77+
From the docs: It divides text into terms based on word boundaries,
78+
which makes it language-neutral for most use cases.
79+
It converts all terms to lower case and removes punctuation.
80+
*/
81+
"type": "standard"
82+
},
83+
"tokenFilters": [
84+
/**
85+
Remove accents
86+
*/
87+
{
88+
"type": "icuFolding"
89+
},
90+
/**
91+
Remove possessive suffixes, e.g., "John's" -> "John"
92+
*/
93+
{
94+
"type": "englishPossessive"
95+
},
96+
/**
97+
Stem words to their root form, e.g., "running" -> "run"
98+
*/
99+
{
100+
"type": "kStemming"
101+
}
102+
]
103+
},
104+
{
105+
"name": "url_normalizer_analyzer",
106+
"tokenizer": {
107+
"type": "keyword"
108+
},
109+
"tokenFilters": [
110+
{
111+
"type": "lowercase"
112+
},
113+
{
114+
/**
115+
Remove http:// or https:// from the beginning
116+
*/
117+
"type": "regex",
118+
"pattern": "^(https|http)://",
119+
"replacement": "",
120+
"matches": "first"
121+
},
122+
{
123+
/**
124+
Remove www. from the beginning
125+
*/
126+
"type": "regex",
127+
"pattern": "^www\\.",
128+
"replacement": "",
129+
"matches": "first"
130+
},
131+
{
132+
/**
133+
Remove all trailing slashes
134+
*/
135+
"type": "regex",
136+
"pattern": "/+$",
137+
"replacement": "",
138+
"matches": "first"
139+
}
140+
]
141+
}
142+
]
143+
}

0 commit comments

Comments
 (0)