Skip to content

Commit 25c7ffa

Browse files
addaleaxCopilot
andauthored
Schema design skill MCP-420 (#7)
* Copy schema-design files from romiluz13/mongodb-agent-skills * Update to Anthropic guidelines * Updates based on content review * Remove files called out as unnecessary * Expand key principles based on eval testing * Fixup: reformat * Asya CR comments * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Copilot CR * SKILL.md updates * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Reduce token count for reference files * Asya CR * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Copilot CR * Drop explicit schema validation entries * Merge document size concern files * Merge referencing reference files * Fixup: reduce SKILL.md size a bit, guide more actively towards when to use specific files * Bring under 25k limit * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Copilot CR * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Copilot CR * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * CR suggestion * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Polymorphic pattern does not need discriminator * Routing improvements * Dachary CR * Further CR * Update skills/mongodb-schema-design/references/pattern-schema-versioning.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Copilot CR --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent ff21280 commit 25c7ffa

19 files changed

+2898
-0
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
name: mongodb-schema-design
3+
description: MongoDB schema design patterns and anti-patterns. Use when designing data models, reviewing schemas, migrating from SQL, or troubleshooting performance issues caused by schema problems. Triggers on "design schema", "embed vs reference", "MongoDB data model", "schema review", "unbounded arrays", "one-to-many", "tree structure", "16MB limit", "schema validation", "JSON Schema", "time series", "schema migration", "polymorphic", "TTL", "data lifecycle", "archive", "index explosion", "unnecessary indexes", "approximation pattern", "document versioning".
4+
license: Apache-2.0
5+
---
6+
7+
# MongoDB Schema Design
8+
9+
Data modeling patterns and anti-patterns for MongoDB, maintained by MongoDB. Bad schema is the root cause of most MongoDB performance and cost issues—queries and indexes cannot fix a fundamentally wrong model.
10+
11+
## When to Apply
12+
13+
Reference these guidelines when:
14+
- Designing a new MongoDB schema from scratch
15+
- Migrating from SQL/relational databases to MongoDB
16+
- Reviewing existing data models for performance issues
17+
- Troubleshooting slow queries or growing document sizes
18+
- Deciding between embedding and referencing
19+
- Modeling relationships (one-to-one, one-to-many, many-to-many)
20+
- Implementing tree/hierarchical structures
21+
- Seeing Atlas Schema Suggestions or Performance Advisor warnings
22+
- Hitting the 16MB document limit
23+
- Adding schema validation to existing collections
24+
25+
## Quick Reference
26+
27+
### 1. Schema Anti-Patterns - 3 rules
28+
29+
- [antipattern-unnecessary-collections](references/antipattern-unnecessary-collections.md) - Splitting homogeneous data into multiple collections is often an anti-pattern; consult this reference to validate whether this is the case.
30+
- [antipattern-excessive-lookups](references/antipattern-excessive-lookups.md) - When encountering overly normalized collections that reference each other or frequent and possibly slow $lookup operations, consult this reference to validate whether this is problematic and how to fix it.
31+
- [antipattern-unnecessary-indexes](references/antipattern-unnecessary-indexes.md) - Consult this reference when indexes overlap or are not used by queries, to identify and remove unnecessary indexes that add overhead without benefit.
32+
33+
### 2. Schema Fundamentals - 4 rules
34+
35+
- [fundamental-embed-vs-reference](references/fundamental-embed-vs-reference.md) - Consult this reference for approaches to modeling different types of relationships (1:1, 1:few, 1:many, many:many, tree/hierarchical data) and how to decide between embedding and referencing based on access patterns.
36+
- [fundamental-document-model](references/fundamental-document-model.md) - Fundamentals of the document model. Consult this reference when migrating from SQL or other normalized data to a document database like MongoDB.
37+
- [fundamental-schema-validation](references/fundamental-schema-validation.md) - Consult this reference when creating new collections, or adding validation to existing collections, for example in response to finding inconsistent document structures or data quality issues.
38+
- [fundamental-document-size](references/fundamental-document-size.md) - Consult this reference when documents hit the hard 16MB limit, or when accesses are slower than expected as a result of large documents.
39+
40+
### 3. Design Patterns - 11 rules
41+
42+
- [pattern-approximation](references/pattern-approximation.md) - Use approximate values for high-frequency counters
43+
- [pattern-archive](references/pattern-archive.md) - Move historical data to separate/cold storage for performance
44+
- [pattern-attribute](references/pattern-attribute.md) - Collapse many optional fields into key-value attributes
45+
- [pattern-bucket](references/pattern-bucket.md) - Group time-series or IoT data into buckets
46+
- [pattern-computed](references/pattern-computed.md) - Pre-calculate expensive aggregations
47+
- [pattern-document-versioning](references/pattern-document-versioning.md) - Track document changes to enable historical queries and audit trails
48+
- [pattern-extended-reference](references/pattern-extended-reference.md) - Cache frequently-accessed data from related entities
49+
- [pattern-outlier](references/pattern-outlier.md) - Handle collections in which a small subset of documents are much larger than the rest, to prevent outliers from dominating memory and index costs
50+
- [pattern-polymorphic](references/pattern-polymorphic.md) - Store different types of entities in the same collection, often when they are different types of the same base entity (e.g. different types of users or different types of products)
51+
- [pattern-schema-versioning](references/pattern-schema-versioning.md) - Schema evolution, preventing drift, and safe online migrations. Consult when encountering inconsistent document structures, or when planning a schema change that cannot be applied atomically.
52+
- [pattern-time-series-collections](references/pattern-time-series-collections.md) - Use native time series collections for high-frequency time series data
53+
54+
## Key Principle
55+
56+
> **"Data that is accessed together should be stored together."**
57+
58+
This is MongoDB's core philosophy. Embedding related data eliminates joins, reduces round trips, and enables atomic updates. Reference only when you must.
59+
60+
A core way to implement this philosophy is the fact that MongoDB exposes **flexible schemas**. This means you can have different fields in different documents, and even different structures. This allows you to model data in the way that best fits your access patterns, without being constrained by a rigid schema. For example, if different documents have different sets of fields, that is perfectly fine as long as it serves your application's needs. You can also use schema validation to enforce certain rules while still allowing for flexibility.
61+
62+
Another implication of the key principle is that information about the expected read and write workload becomes very relevant to schema design. If pieces of information from different entities are often queried or updated together, that means that prioritizing co-location of that data in the same document can lead to significant performance benefits. On the other hand, if certain pieces of information are rarely accessed together, it may make sense to store them separately to avoid loading more data than necessary.
63+
64+
#### Schema Fundamentals Summary
65+
66+
- **Embed vs Reference**: Choose embedding or referencing based on access patterns: embed when data is always accessed together (1:1, 1:few, bounded arrays, atomic updates needed); reference when data is accessed independently, relationships are many-to-many, or arrays can grow without bound.
67+
- **Data accessed together stored together**: MongoDB's core principle: design schemas around queries, not entities. Embed related data to eliminate cross-collection joins and reduce round trips. Identify your API endpoints/pages, list the data each returns, then shape documents to match those queries.
68+
- **Embrace the document model**: Don't recreate SQL tables 1:1 as MongoDB collections. Instead, denormalize joined tables into rich documents for single-query reads and atomic updates. When migrating from SQL, identify tables that are always joined together and merge them into single documents.
69+
- **Schema validation**: Use MongoDB's built-in `$jsonSchema` validator to catch invalid data at the database level (type checks, required fields, enum constraints, array size limits). Start with `validationLevel: "moderate"` and `validationAction: "warn"` on existing collections, then tighten to `strict`/`error`.
70+
- **16MB document limit**: MongoDB documents cannot exceed 16MB—this is a hard limit, not a guideline. Common causes: unbounded arrays, large embedded binaries, deeply nested objects. Mitigate by moving unbounded data to separate collections and monitoring document sizes with `$bsonSize`.
71+
72+
## Embed/Reference Decision Framework
73+
74+
| Relationship | Cardinality | Access Pattern | Recommendation |
75+
|-------------|-------------|----------------|----------------|
76+
| One-to-One | 1:1 | Always together | Embed |
77+
| One-to-Few | 1:N (N < 100) | Usually together | Embed array |
78+
| One-to-Many | 1:N (N > 100) | Often separate | Reference |
79+
| Many-to-Many | M:N | Varies | Two-way reference |
80+
81+
This is a **rough** guideline, and whether to embed or reference depends on your specific access patterns, data size, and read/write frequencies. Always verify with your actual workload.
82+
83+
## How to Use
84+
85+
Each reference file listed above contains detailed explanations and code examples. Use the descriptions in the Quick Reference to identify which files are relevant to your current task.
86+
87+
Each reference file contains:
88+
- Brief explanation of why it matters
89+
- Incorrect code example with explanation
90+
- Correct code example with explanation
91+
- "When NOT to use" exceptions
92+
- Performance impact and metrics
93+
- Verification diagnostics
94+
95+
---
96+
97+
## How These Rules Work
98+
99+
### MongoDB MCP Integration
100+
101+
For automatic verification, connect the [MongoDB MCP Server](https://github.com/mongodb-js/mongodb-mcp-server).
102+
103+
If the MCP server is running and connected, I can automatically run verification commands to check your actual schema, document sizes, array lengths, index usage, and more. This allows me to provide tailored recommendations based on your real data, not just code patterns.
104+
105+
**⚠️ Security**: Use `--readOnly` for safety. Remove only if you need write operations.
106+
107+
When connected, I can automatically:
108+
- Infer schema via `mcp__mongodb__collection-schema`
109+
- Measure document/array sizes via `mcp__mongodb__aggregate`
110+
- Check collection statistics via `mcp__mongodb__db-stats`
111+
112+
### ⚠️ Action Policy
113+
114+
**I will NEVER execute write operations without your explicit approval.**
115+
116+
Before any write or destructive operation via MCP, I will: (1) summarize the exact operation (collection, index/validator, estimated number of docs affected), and (2) ask for explicit confirmation (yes/no). I will not proceed on partial or ambiguous approvals.
117+
118+
| Operation Type | MCP Tools | Action |
119+
|---------------|-----------|--------|
120+
| **Read (Safe)** | `find`, `aggregate`, `collection-schema`, `db-stats`, `count` | I may run automatically to verify |
121+
| **Write (Requires Approval)** | `update-many`, `insert-many`, `create-collection` | I will show the command and wait for your "yes" |
122+
| **Destructive (Requires Approval)** | `delete-many`, `drop-collection`, `drop-database` | I will warn you and require explicit confirmation |
123+
124+
When I recommend schema changes or data modifications:
125+
1. I'll explain **what** I want to do and **why**
126+
2. I'll show you the **exact command**
127+
3. I'll **wait for your approval** before executing
128+
4. If you say "go ahead" or "yes", only then will I run it
129+
130+
**Your database, your decision.** I'm here to advise, not to act unilaterally.
131+
132+
### Working Together
133+
134+
If you're not sure about a recommendation:
135+
1. Run the verification commands I provide
136+
2. Share the output with me
137+
3. I'll adjust my recommendation based on your actual data
138+
139+
We're a team—let's get this right together.
140+
141+
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Reduce Excessive $lookup Usage
3+
impact: CRITICAL
4+
impactDescription: "Can reduce query cost on hot paths by avoiding repeated cross-collection joins"
5+
tags: schema, lookup, anti-pattern, joins, denormalization, atlas-suggestion
6+
---
7+
8+
## Reduce Excessive $lookup Usage
9+
10+
**Frequent $lookup operations on hot paths can indicate over-normalization.** `$lookup` is useful, but repeated joins can be slower and more resource-intensive than querying a single collection, especially when supporting indexes or match selectivity are weak. If the same related fields are read together often, consider embedding or extended references.
11+
12+
**Incorrect (constant $lookup for common operations):**
13+
14+
```javascript
15+
// Every product page requires repeated joins across collections
16+
db.products.aggregate([
17+
{ $match: { _id: productId } },
18+
{ $lookup: {
19+
from: "categories", // Collection scan #2
20+
localField: "categoryId",
21+
foreignField: "_id",
22+
as: "category"
23+
}},
24+
{ $lookup: {
25+
from: "brands", // Collection scan #3
26+
localField: "brandId",
27+
foreignField: "_id",
28+
as: "brand"
29+
}},
30+
{ $unwind: "$category" },
31+
{ $unwind: "$brand" }
32+
])
33+
// Multiple join stages add planning/execution overhead on hot paths
34+
```
35+
36+
Join cost depends on cardinality, stage order, index support, and result size. Measure before deciding to embed.
37+
38+
**Correct (denormalize frequently-joined data):**
39+
40+
Embed data that is always displayed alongside the product directly in the product document: include category fields (`_id`, `name`, `path`) and brand fields (`_id`, `name`, `logo`) as subdocuments. A single indexed query returns complete product data without `$lookup`. Listing queries (e.g. by category) also run against a single collection.
41+
42+
**Managing denormalized data updates:**
43+
44+
When category data changes (a rare event), use `updateMany` to update all products matching that category’s `_id` with the new field values. For frequently-changing data, keep both a reference ID (`brandId`) and a cache subdocument (`brandCache`) with a `cachedAt` timestamp; refresh the cache when it exceeds a staleness threshold.
45+
46+
**When NOT to use this pattern:**
47+
48+
- **Data changes frequently and independently**: If brand logos change daily, denormalization creates update overhead.
49+
- **Rarely-accessed data**: Don't embed review details if only a small fraction of product views load reviews.
50+
- **Many-to-many with high cardinality**: Avoid embedding large or fast-growing relationship sets.
51+
- **Analytics queries**: Batch jobs can afford $lookup latency; real-time queries cannot.
52+
53+
## Verify with
54+
55+
```javascript
56+
// Find pipelines with multiple $lookup stages
57+
db.setProfilingLevel(1, { slowms: 50 }) // Disable afterwards
58+
db.system.profile.find({
59+
"command.aggregate": { $exists: true },
60+
"command.pipeline.$lookup": {
61+
$exists: true
62+
}
63+
}).sort({ millis: -1 })
64+
65+
// Check if $lookup foreign fields are indexed
66+
db.reviews.aggregate([
67+
{ $indexStats: {} }
68+
])
69+
// Look for index supporting the query in result
70+
71+
// Measure $lookup impact
72+
db.products.aggregate([
73+
{ $match: { category: "electronics" } },
74+
{ $lookup: { from: "brands", localField: "brandId", foreignField: "_id", as: "brand" } }
75+
]).explain("executionStats")
76+
// Check totalDocsExamined in $lookup stage
77+
```
78+
79+
Atlas Schema Suggestions flags: "Reduce $lookup operations"
80+
81+
Reference: [Reduce Lookup Operations](https://mongodb.com/docs/manual/data-modeling/design-antipatterns/reduce-lookup-operations/)
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
title: Reduce Unnecessary Collections
3+
impact: CRITICAL
4+
impactDescription: "Reduces avoidable joins when related data is repeatedly queried together"
5+
tags: schema, collections, anti-pattern, embedding, normalization, atlas-suggestion
6+
---
7+
8+
## Reduce Unnecessary Collections
9+
10+
**Collection count alone is not the anti-pattern.** The anti-pattern is using collections as a substitute for indexes — creating one collection per category, time period, or partition key instead of indexing a single collection. Every collection carries a default `_id` index that consumes storage and strains the replica set, and cross-collection queries require `$lookup` or `$unionWith`, adding complexity and overhead.
11+
12+
**Incorrect (one collection per day as partitioning strategy):**
13+
14+
Creating one collection per time period (e.g. `temperatures_2024_05_10`, `temperatures_2024_05_11`, …) means each collection carries its own default `_id` index (365 collections/year = 365 extra indexes), cross-day queries require `$unionWith` across many collections, schema validation / indexes / TTL must be duplicated on every collection, and application code must dynamically resolve the collection name for each query.
15+
16+
**Correct (single collection with an index):**
17+
18+
```javascript
19+
// All readings in one collection — the index does the partitioning work
20+
{ _id: ObjectId(), timestamp: ISODate("2024-05-10T10:00:00Z"), temperature: 60 }
21+
{ _id: ObjectId(), timestamp: ISODate("2024-05-10T11:00:00Z"), temperature: 61 }
22+
{ _id: ObjectId(), timestamp: ISODate("2024-05-11T10:00:00Z"), temperature: 68 }
23+
24+
db.temperatures.createIndex({ timestamp: 1 })
25+
26+
// Efficient range query — one collection, one index
27+
db.temperatures.find({
28+
timestamp: { $gte: ISODate("2024-05-10"), $lt: ISODate("2024-05-11") }
29+
})
30+
31+
// Optional TTL for automatic expiry (e.g. 90 days)
32+
db.temperatures.createIndex({ timestamp: 1 }, { expireAfterSeconds: 7776000 })
33+
```
34+
35+
**Even better (bucket pattern or time series collection):**
36+
37+
For high-volume time-stamped data, group readings into buckets or use a native time series collection, which is optimized for this workload:
38+
39+
```javascript
40+
// Bucket pattern — one document per day
41+
{
42+
_id: ISODate("2024-05-10T00:00:00Z"),
43+
readings: [
44+
{ timestamp: ISODate("2024-05-10T10:00:00Z"), temperature: 60 },
45+
{ timestamp: ISODate("2024-05-10T11:00:00Z"), temperature: 61 },
46+
{ timestamp: ISODate("2024-05-10T12:00:00Z"), temperature: 64 }
47+
]
48+
}
49+
50+
// In this particular case, a native time series collection
51+
// is also a good option to consider
52+
db.createCollection("temperatures", {
53+
timeseries: { timeField: "timestamp", granularity: "hours" }
54+
})
55+
```
56+
57+
**When to use separate collections:**
58+
59+
| Scenario | Separate Collection | Why |
60+
|----------|--------------------|----|
61+
| Data accessed independently | Yes | Different query patterns |
62+
| Unbounded relationships | Yes | Prevents document growth |
63+
| Many-to-many | Yes | Students ↔ Courses |
64+
| 1:1 always together | No (embed) | User and profile |
65+
66+
**When NOT to use this pattern:**
67+
68+
- **Data is genuinely independent**: Products exist separately from orders; don't embed full product catalog in every order.
69+
- **Frequent independent updates**: If customer email changes shouldn't update all historical orders (it shouldn't).
70+
- **Data is accessed in different contexts**: Same address entity used for shipping, billing, user profile—keep it separate.
71+
- **Regulatory requirements**: Some industries require normalized data for audit trails.
72+
73+
## Verify with
74+
75+
```javascript
76+
// Count your collections
77+
for (const d of db.adminCommand({ listDatabases: 1 }).databases) {
78+
const colls = db.getSiblingDB(d.name).getCollectionNames().length
79+
print(`${d.name}: ${colls} collections`)
80+
}
81+
// Count alone is not sufficient: combine with access and index/storage evidence
82+
83+
// Check if collections are always accessed together
84+
// If orders always needs customer, items, addresses
85+
// → they should be embedded
86+
db.system.profile.aggregate([
87+
{ $match: { op: "query" } },
88+
{ $group: { _id: "$ns", count: { $sum: 1 } } },
89+
{ $sort: { count: -1 } }
90+
])
91+
// Collections with similar access patterns should be combined
92+
```
93+
94+
Atlas Schema Suggestions flags: "Reduce number of collections"
95+
96+
Reference: [Reduce the Number of Collections](https://mongodb.com/docs/manual/data-modeling/design-antipatterns/reduce-collections/)

0 commit comments

Comments
 (0)