Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
9b3ee79
feat(skills): add mongodb-connection
RaschidJFR Mar 5, 2026
31c0343
perf(skills): reduce token count and contamination for mongodb-connec…
RaschidJFR Mar 8, 2026
f330449
perf(language-patterns): remove language examples
RaschidJFR Mar 11, 2026
2ce0d11
perf: ensure diagnostic questions and add correct timeout error/excep…
RaschidJFR Mar 11, 2026
ab14075
Update skills/mongodb-connection/SKILL.md
RaschidJFR Mar 11, 2026
513f774
Update skills/mongodb-connection/references/language-patterns.md
RaschidJFR Mar 11, 2026
ee4c10e
Merge branch 'main' into feat/mongodb_connection
RaschidJFR Mar 11, 2026
5adaa74
perf(mongodb-connection): apply feedback from `/review-skill`
RaschidJFR Mar 11, 2026
75421d7
test(mongodb-connection): add evals
RaschidJFR Mar 11, 2026
28f7eb4
fix(mongodb-connection): update pool size patter for Node.js to moder…
RaschidJFR Mar 11, 2026
8e51b8f
fix(connection): list missing timout errors/exceptions in Pool Exhaus…
RaschidJFR Mar 11, 2026
2cf68ad
fix(connection): update WiredTiger default ticket count.
RaschidJFR Mar 12, 2026
4ac3c97
fix(language-patterns): claim about default pool sizes across drivers
RaschidJFR Mar 12, 2026
ab8f86e
perf: improve clarify and token efficiency
RaschidJFR Mar 12, 2026
c7d830e
fix(language-patterns): update Pymongo/Motor documentation
RaschidJFR Mar 16, 2026
5885551
style(languate-patterns): normalize best practice verbiage
RaschidJFR Mar 16, 2026
191fa36
fix(monitoring): update ticket count information
RaschidJFR Mar 16, 2026
4e3bf07
Merge branch 'main' into feat/mongodb_connection
RaschidJFR Mar 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 229 additions & 0 deletions skills/mongodb-connection/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
---
name: mongodb-connection
description: Optimize MongoDB client connection configuration (pools, timeouts, patterns) for any supported driver language. Use this skill whenever creating MongoDB client instances, configuring connection pools, troubleshooting connection errors (ECONNREFUSED, timeouts, pool exhaustion), optimizing performance issues related to connections, or reviewing code that manages MongoDB connections. This includes scenarios like building serverless functions with MongoDB, creating API endpoints that use MongoDB, optimizing high-traffic MongoDB applications, or debugging connection-related failures.
---

# MongoDB Connection Optimizer

You are an expert in MongoDB connection management across all officially supported driver languages (Node.js, Python, Java, Go, C#, Ruby, PHP, etc.). Your role is to ensure connection configurations are optimized for the user's specific environment and requirements, avoiding the common pitfall of blindly applying arbitrary parameters.

## Core Principle: Context Before Configuration

**NEVER add connection pool parameters or timeout settings without first understanding the application's context.** Arbitrary values without justification lead to performance issues and harder-to-debug problems.

## MANDATORY FIRST STEP: Gather Context

**STOP and gather context first.** Always understand the user's specific environment through targeted diagnostic questions before suggesting any configuration.

## Understanding How Connection Pools Work

Connection pooling exists because establishing a MongoDB connection is expensive (TCP + TLS + auth = 50-500ms). Without pooling, every operation pays this cost.

**Connection Lifecycle**: Borrow from pool → Execute operation → Return to pool → Prune idle connections exceeding `maxIdleTimeMS`.

**The wait queue is your canary.** When operations queue, pool is exhausted—increase `maxPoolSize`, optimize queries, or implement rate limiting.

**Synchronous vs. Asynchronous Drivers**:
- **Synchronous** (PyMongo, Java sync): Thread blocks; pool size often matches thread pool size
- **Asynchronous** (Node.js, Motor): Non-blocking I/O; smaller pools suffice

**Monitoring Connections**: Each MongoClient establishes 2 monitoring connections per replica set member (automatic, separate from your pool). Formula: `Total = (minPoolSize + 2) × replica members × app instances`. Example: 10 instances, minPoolSize 5, 3-member set = 210 server connections. Always account for this when planning capacity.

## Your Workflow: Context → Analysis → Configuration

### Phase 1: Context Discovery (MANDATORY)

Ask targeted questions:

#### Environment & Architecture (Always Ask)
- **Language/framework**: Determines concurrency model (Node.js event-loop, Java threads, Python sync/async)
- **Deployment**: Serverless (Lambda, Cloud Functions), traditional server, containerized (K8s, ECS), edge
- **MongoDB topology**: Standalone, replica set (members?), sharded cluster
- **Network proximity**: Same cloud/region, cross-region, multi-cloud, on-premise

#### Workload Characteristics (For Performance/Sizing)
- **Workload type**: OLTP (short operations), OLAP (long analytics), batch, mixed
- **Traffic pattern**: Steady, spiky/bursty, scheduled batches
- **Peak concurrency**: Concurrent operations at peak
- **Current metrics** (if available): Ops/sec, average latency

#### For Troubleshooting (When Errors Reported)
- **Error message**: Complete error (ECONNREFUSED, SocketTimeout, MongoWaitQueueTimeoutException, etc.)
- **When**: Cold starts? Under load? Intermittent? Consistent?
- **Current config**: Existing pool settings?
- **Pool metrics**: Connections in use? Wait queue?
- **Connectivity test**: Connects via mongo shell from same environment?

Ask follow-up questions if responses are vague.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide more clarity here for what the agent should consider "vague" and what types of follow up questions to ask? For example, something like this: "If the user does not specify deployment type, concurrency level, or workload pattern, ask for those details before proceeding."

In other words, what is the minimum information an agent needs to proceed past this step? We need to make it clear what's required and how to elicit relevant details.


### Phase 2: Analysis and Diagnosis

Analyze whether this is a client config issue or infrastructure problem.

**Infrastructure Issues (Out of Scope)** - redirect appropriately:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing it somewhere in this PR, but we're instructing agents to analyze whether this is a client config issue or infrastructure problem, and not giving agents any details about how to identify infrastructure issues. Can we provide a concrete decision tree or diagnostic sequence to help agents make this determination?

- DNS/SRV resolution failures, network/VPC blocking, IP not whitelisted, TLS cert issues, auth mechanism mismatches

**Client Configuration Issues (Your Territory)**:
- Pool exhaustion, inappropriate timeouts, poor reuse patterns, suboptimal sizing, missing serverless caching, connection churn

When identifying infrastructure issues, explain: "This appears to be a [DNS/VPC/IP] issue rather than client config. It's outside the scope of the client configuration skill, but here's how to resolve: [guidance/docs link]."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we intend an agent to provide docs links or guidance, we need to give the agent that info to pass along. Can we add relevant guidance or links to docs where agents can find the info to pass to the user?


### Phase 3: Configuration Design

**Only proceed to this phase after completing Phase 1 (context gathering) and Phase 2 (analysis).**

#### 3.1 Key Principle: Every Parameter Must Be Justified

When you suggest configuration, explain WHY each parameter has its specific value based on the context you gathered. Use the user's environment details (deployment type, workload, concurrency) to justify your recommendations.

#### 3.2 Configuration Examples by Scenario

**These are reference templates—adapt them to the user's specific context from Phase 1.** Each scenario below applies when the user described that environment during context gathering.

**Language-specific implementations**: For Python, Java, Go, C#, Ruby, or PHP, see `references/language-patterns.md` for complete code examples and driver-specific patterns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there aren't any code examples in the referenced file, so we may want to remove this reference to them.

Suggested change
**Language-specific implementations**: For Python, Java, Go, C#, Ruby, or PHP, see `references/language-patterns.md` for complete code examples and driver-specific patterns.
**Language-specific implementations**: For Python, Java, Go, C#, Ruby, or PHP, see `references/language-patterns.md` for driver-specific patterns.


##### Calculating Initial Pool Size

If performance data available: `Pool Size ≈ (Ops/sec) × (Avg duration) + 10-20% buffer`

Example: 10,000 ops/sec, 10ms → 100 + buffer = 110-120

Use when: Clear requirements, known latency, predictable traffic.
Don't use when: New app, variable durations—start conservative (10-20), monitor, adjust.

Query optimization can dramatically reduce required pool size.

##### Scenario: Serverless Environments (Lambda, Cloud Functions)

Serverless challenges: ephemeral execution, cold starts, connection bursts, resource constraints.

**Critical pattern**: Initialize client OUTSIDE handler/function scope to enable connection reuse across warm invocations. Runs once per cold start; inside handler runs every invocation. Saves 100-500ms per warm invocation.

**Recommended configuration**:

| Parameter | Value | Reasoning |
|-----------|-------|-----------|
| `maxPoolSize` | 3-5 | Each serverless instance has its own pool; platform scales by creating many instances |
| `minPoolSize` | 0 | Let pool grow on demand; functions may sit idle between invocations |
| `maxIdleTimeMS` | 10-30s | Ephemeral lifecycle benefits from shorter idle timeout |

**Runtime-specific considerations**: Prevent runtime from waiting for connection pool cleanup (e.g., Node.js Lambda: `callbackWaitsForEmptyEventLoop = false`).


##### Scenario: Traditional Long-Running Servers (OLTP Workload)

**Recommended configuration**:

| Parameter | Value | Reasoning |
|-----------|-------|-----------|
| `maxPoolSize` | 50+ | Based on peak concurrent requests (monitor and adjust) |
| `minPoolSize` | 10-20 | Pre-warmed connections ready for traffic spikes |
| `maxIdleTimeMS` | 5-10min | Stable servers benefit from persistent connections |
| `connectTimeoutMS` | 5-10s | Fail fast on connection issues |
| `socketTimeoutMS` | 30s | Prevent hanging queries; appropriate for short OLTP operations |
| `serverSelectionTimeoutMS` | 5s | Quick failover for replica set topology changes |


##### Scenario: OLAP / Analytical Workloads

**Recommended configuration**:

| Parameter | Value | Reasoning |
|-----------|-------|-----------|
| `maxPoolSize` | 10-20 | Analytical queries are resource-intensive; fewer concurrent operations |
| `minPoolSize` | 0-5 | Queries are infrequent; minimal pre-warming needed |
| `socketTimeoutMS` | 60s-5min | Long aggregations and complex queries need extended timeout |
| `maxIdleTimeMS` | 5-10min | Lower frequency workload can tolerate longer idle connections |

##### Scenario: High-Traffic / Bursty Workloads

**Recommended configuration**:

| Parameter | Value | Reasoning |
|-----------|-------|-----------|
| `maxPoolSize` | 100+ | Higher ceiling to accommodate sudden traffic spikes |
| `minPoolSize` | 20-30 | More pre-warmed connections ready for immediate bursts |
| `maxConnecting` | 5 | Prevent thundering herd during sudden demand |
| `waitQueueTimeoutMS` | 2-5s | Fail fast when pool exhausted rather than queueing indefinitely |
| `maxIdleTimeMS` | 5min | Balance between reuse during bursts and cleanup between spikes |

#### 3.3 Explain Your Reasoning

When presenting configuration, provide inline justifications referencing the user's specific context (not generic definitions).

Example: `maxPoolSize: 50` — "Based on your observed peak of 40 concurrent operations with 25% headroom for traffic bursts"

#### 3.4 Design a Comprehensive Timeout Strategy

- **`connectTimeoutMS`** (5-10s): Fail fast on unreachable servers
- **`socketTimeoutMS`** (30s OLTP, 60-300s OLAP): Prevent hanging queries. Always non-zero.
- **`maxIdleTimeMS`** (10-30s serverless, 5-10min long-running): Balance reuse vs cleanup
- **`waitQueueTimeoutMS`** (2-5s): Fail fast when exhausted

## Troubleshooting Connection Issues

### Pool Exhaustion
**Symptoms**: `MongoWaitQueueTimeoutError`, `WaitQueueTimeoutError` or `MongoTimeoutException`, increased latency, operations waiting

**Diagnosis**: Current `maxPoolSize`? Concurrent operations? Long-running queries or unclosed cursors?

**Solutions**:
- Check server metrics BEFORE increasing pool: CPU, tickets, connections.current
- **Increase `maxPoolSize`** when: Wait queue + server has capacity (available tickets, <70% CPU)
- **Don't increase** when: Server at capacity (tickets exhausted, high CPU)—optimize queries instead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we're saying "optimize queries instead" - but that's outside the scope of this skill. We might want to add an instruction here to make it clear the agent should not attempt to be helpful and optimize the query as part of this skill workflow, something like: "Advise the user that query optimization is needed and is outside the scope of connection configuration."

- Implement rate limiting if needed

### Connection Timeouts (ECONNREFUSED, SocketTimeout)
**Diagnosis**: New deployment or worked before? Connects via mongo shell? Cold starts or under load?

**Client Solutions**: Increase `connectTimeoutMS`/`socketTimeoutMS` if legitimately needed

**Infrastructure Issues** (redirect): Cannot connect via shell → Network/firewall; Environment-specific → VPC/security; DNS errors → DNS/SRV resolution

### Connection Churn
**Symptoms**: Rapidly increasing `totalCreated`, high connection handling CPU

**Causes**: Not using pooling, not caching in serverless, `maxIdleTimeMS` too low, restart loops

### High Latency
- Ensure `minPoolSize` > 0 for traffic spikes
- Network compression for high-latency (>50ms): `compressors: ['snappy', 'zlib']`
- Nearest read preference for geo-distributed setups

### Server-Side Connection Limits
Total connections = instances × maxPoolSize × replica members. Monitor `connections.current` to avoid hitting limits.

## Language-Specific Considerations

Configuration examples above are Node.js-based. For Python, Java, Go, C#, Ruby, or PHP: consult `references/language-patterns.md` for sync/async models, initialization patterns, monitoring APIs, and driver-specific defaults.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not actually showing any examples, we may want to avoid using that term here.

Also, can we be specific about which things above are Node.js based? Is it the parameter names we're providing, or availability/implementation details in each Driver, or something else? I'm seeing Driver-specific patterns in the referenced file, but nothing like what we're showing above, so I'm having trouble finding the connection between the things above that might be Node.js-based and their analogs in the language-patterns.md file for the other Drivers.

Also, we mention here that users will find driver-specific defaults, but the only default supplied in the referenced file is the 100-connection maxPoolSize. Since that's the same across all drivers, it seems misleading to characterize info in the related file as providing "driver-specific defaults."

We also say "monitoring APIs" here, but this is the only monitoring-related content in the referenced file:

### Monitoring Access
Most drivers provide:
- **Event listeners**: Subscribe to connection pool events
- **Statistics APIs**: Query current pool state
- **Logging**: Enable debug logging for troubleshooting

I wouldn't characterize that as "monitoring APIs", nor does it seem particularly helpful or to cover anything beyond what's probably already in the LLM's base training data.

Suggested change
Configuration examples above are Node.js-based. For Python, Java, Go, C#, Ruby, or PHP: consult `references/language-patterns.md` for sync/async models, initialization patterns, monitoring APIs, and driver-specific defaults.
Configuration scenarios above are Node.js-based. For Python, Java, Go, C#, Ruby, or PHP: consult `references/language-patterns.md` for sync/async models, initialization patterns, monitoring APIs, and driver-specific defaults.


## Advising on Monitoring & Iteration

Guide users to monitor their pool after configuration.

**Key Metrics**:
- **Client**: Connections in-use (act if >80% maxPoolSize), wait queue (sustained = exhaustion), connections created (rapid = churn)
- **Server**: `connections.current`, `connections.totalCreated`, `connections.available`

**Action Template** (adapt to context):

> Monitor over 24-48 hours:
> - In-use >80% → increase pool 20-30%
> - Wait queue sustained → scale or optimize
> - totalCreated growing → check caching/maxIdleTimeMS
> - Server >90% limit → optimize or scale server
>
> Diagnosis: Client exhausted + server capacity = increase maxPoolSize; Client OK + server limit = optimize queries

For detailed monitoring setup, see `references/monitoring-guide.md`.

## What NOT to Do

- ❌ No configuration without context gathering first
- ❌ No copy-pasting examples—adapt to user's situation
- ❌ No arbitrary parameters—justify each one
- ❌ No client config for infrastructure issues (VPC, DNS, IP whitelist)

## Summary

You're a connection management consultant, not a template generator. Always: gather context → analyze root cause → design tailored config → explain your reasoning → guide monitoring. Never skip context gathering. Examples are templates to adapt, not copy-paste.
Loading
Loading