-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
feat(mcp): migrate to modern HTTP transport with clean architecture refactor #1525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(mcp): migrate to modern HTTP transport with clean architecture refactor #1525
Conversation
…gration BREAKING CHANGE: MCP tools don't work with JWT authentication enabled Fixes MCP server completely broken since crawl4ai:0.6.0rc1-r2 (issue unclecode#1316, 2+ months old). Complete restructuring from deprecated SSE transport to modern HTTP-based MCP architecture. ## Problem Solved - MCP server has been non-functional for 2+ months due to SSE transport failures - Users experiencing "Unexpected message" errors and connection failures - Only workaround was downgrading to v0.6.0rc1-r2 - Issue unclecode#1316: unclecode#1316 ## Breaking Changes - MCP tools don't forward JWT Authorization headers when security.jwt_enabled=true - Internal proxy calls will fail authentication if JWT is enabled - Workaround: Disable JWT authentication or implement header forwarding for MCP usage - All regular API endpoints maintain existing auth requirements ## Business Value & User Impact - **Restores MCP functionality** - 7 working tools after 2+ months of failures - Migrated from deprecated SSE to stable FastMCP HTTP transport - Production-ready error handling with distributed request tracing - Full backward compatibility maintained for existing API consumers - Updated playground to support new parameter names (filter/query/cache vs f/q/c) ## Architectural Improvements - Clean separation of concerns: app/, core/, middleware/ modular structure - Service-oriented architecture replacing monolithic mcp_bridge.py (252 lines) - New MCPService class (436 lines) with structured tool registration - Centralized error handling with correlation IDs and structured responses - Factory-like pattern for tool registration replacing runtime route introspection ## Technical Implementation - **SSE → HTTP Migration**: Replaced unstable sse-starlette with FastMCP 2.12.3 - Unified error context system (ErrorContext, ErrorHandlingMiddleware) - Comprehensive result handler consolidating multiple protocol implementations - Enhanced schema definitions with Pydantic v2 compatibility - Docker configuration updates with production deployment optimizations ## Quality & Testing - Comprehensive test suite: 1,507 lines across 6 new test files - End-to-end API testing with backward compatibility validation - MCP protocol integration testing with fastmcp client - Error scenario coverage with proper test isolation - TestClient integration for efficient in-process testing Closes unclecode#1316 Major components: MCP service layer, error handling system, test infrastructure Known limitation: JWT auth incompatible with MCP tools (documented in code) 🧙 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Added `pytest-xdist` for parallel test execution. - Added `httpx` as a test client for improved HTTP testing capabilities. - Updated `pytest` and `pytest-cov` to their latest versions for enhanced testing features. These updates aim to improve the testing framework and ensure better performance during development.
…rawlerRunConfig - Added a validation check to ensure that the 'wait_for' parameter is a string (should be a css selector or a javascript expression), raising a ValueError if it is not.
The `/mcp/schema` endpoint was producing a regressed schema that was missing parameter descriptions, default values, and validation constraints. For some tools, entire parameters were absent. The root cause was a custom schema generation mechanism that relied on Python's `inspect` module to introspect endpoint function signatures. This approach was unable to access the rich metadata defined in the Pydantic models used by FastAPI. This commit replaces the manual introspection logic with a new implementation that uses the application's auto-generated OpenAPI schema as the single source of truth. The new approach: 1. Calls `app.openapi()` to get the complete and authoritative API schema. 2. Parses this schema to identify MCP tools and their parameters, correctly handling both Pydantic request bodies and query parameters. 3. Preserves the intentional "flattening" of request body models into keyword arguments for the tool signature. 4. Moves the schema generation logic from `server.py` into the `MCPService` class to improve separation of concerns and code organization. This resolves the schema regressions and ensures that all validation rules and descriptions from the Pydantic models are accurately reflected in the MCP tool definitions.
…rements - Removed `mcp>=1.6.0` from `deploy/docker/requirements.txt` as it is no longer needed. - We now use FastMCP instead of MCP SDK This change helps streamline the Docker environment by eliminating unnecessary dependencies.
…re datetime.now(UTC) - Update imports to include UTC from datetime module - Replace datetime.utcnow() calls with datetime.now(UTC) in api.py and error_context.py - Use lambda wrapper for default_factory in dataclass field 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
WalkthroughThe PR restructures Docker deployment code: introduces app/core/middleware modules, centralized error handling, standardized schemas, and MCP integration via FastMCP; replaces SSE/WebSocket MCP bridge; updates endpoints (including /md compatibility) and output file handling; adjusts configs and docs; adds extensive tests; modifies requirements and Dockerfile copy behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant FastAPI as FastAPI App
participant MW as ErrorHandlingMiddleware
participant Handler as Endpoint Handler
participant EC as ErrorContext
Client->>FastAPI: HTTP Request
FastAPI->>MW: dispatch(request)
MW->>Handler: call_next(request)
Handler-->>MW: Response (success)
MW->>FastAPI: Attach X-Correlation-ID header
FastAPI-->>Client: HTTP 2xx with body
rect rgba(255,220,220,0.3)
note right of Handler: On exception
Handler--x MW: raise Exception
MW->>EC: ErrorContext.from_exception(e, "METHOD PATH")
EC-->>MW: context
MW-->>Client: HTTP error with ErrorResponse + X-Correlation-ID
end
sequenceDiagram
autonumber
actor Client
participant API as /crawl (app.api)
participant Utils as _ensure_within_base_dir
participant RH as ResultHandler
participant FS as Filesystem
Client->>API: POST /crawl {urls, output_path?}
alt output_path provided
API->>Utils: validate(output_path, base_dir)
Utils-->>API: ok or raise HTTP 400
end
API->>RH: normalize_results(results)
RH-->>API: normalized list
alt output_path provided
API->>FS: write JSON to file
FS-->>API: path
end
API-->>Client: 200 {success, results, path?}
sequenceDiagram
autonumber
actor Tooling as MCP Client
participant MCP as MCPService (HTTP)
participant Routes as FastAPI Routes
participant Target as Target Endpoint
Tooling->>MCP: list_tools()
MCP-->>Tooling: tools schema
Tooling->>MCP: call_tool(name, args)
MCP->>Routes: HTTP proxy (compose path/query/body)
Routes->>Target: invoke endpoint
Target-->>Routes: response / error
Routes-->>MCP: HTTP response
MCP-->>Tooling: tool result or error
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
Dockerfile (2)
182-190
: Healthcheck will fail: missing tools (free, redis-cli).
free
(procps) isn’t in python:3.12-slim.redis-cli
is typically inredis-tools
(not installed).
Result: healthcheck errors even when the app is healthy.Fix by installing required tools and/or simplifying the healthcheck:
@@ RUN apt-get update && apt-get install -y --no-install-recommends \ build-essential \ curl \ + procps \ wget \ gnupg \ git \ @@ - redis-server \ + redis-server redis-tools \ supervisor \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* @@ -HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ - CMD bash -c '\ - MEM=$(free -m | awk "/^Mem:/{print \$2}"); \ - if [ $MEM -lt 2048 ]; then \ - echo "⚠️ Warning: Less than 2GB RAM available! Your container might need a memory boost! 🚀"; \ - exit 1; \ - fi && \ - redis-cli ping > /dev/null && \ - curl -f http://localhost:11235/health || exit 1' +HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ + CMD bash -c '\ + awk "/MemTotal/ { if (\$2 < 2097152) exit 1 }" /proc/meminfo && \ + (command -v redis-cli >/dev/null && redis-cli ping >/dev/null || true) && \ + curl -f http://localhost:11235/health || exit 1'
110-121
: Remove redundant install.sh and consolidate installation
The Dockerfile installs the project once in theINSTALL_TYPE
block and again via/tmp/install.sh
. Drop theecho … > /tmp/install.sh
block and incorporate theUSE_LOCAL
/GitHub logic into the existingpip install "/tmp/project"
step to avoid double-installing and bloated layers.pyproject.toml (1)
16-18
: Fix aiohttp version constraint
The version3.11.11
isn’t published on PyPI (latest stable is3.9.5
). Update it to a valid release, for example:"aiohttp>=3.9.5,<4.0"
File: pyproject.toml lines 16–18.
deploy/docker/app/crawler_pool.py (1)
22-45
: Prevent UnboundLocalError in error cleanupIf
_sig(cfg)
(or anything before it) raises,sig
is never assigned, yet thefinally
block still references it. That produces anUnboundLocalError
, masking the original exception and breaking error reporting. Please initialize and guardsig
so cleanup only runs when it has a value.async def get_crawler(cfg: BrowserConfig) -> AsyncWebCrawler: - try: - sig = _sig(cfg) + sig: str | None = None + try: + sig = _sig(cfg) async with LOCK: if sig in POOL: LAST_USED[sig] = time.time(); return POOL[sig] @@ - finally: - if sig in POOL: + finally: + if sig and sig in POOL: LAST_USED[sig] = time.time() - else: + elif sig: # If we failed to start the browser, we should remove it from the pool POOL.pop(sig, None) LAST_USED.pop(sig, None)docs/md_v2/core/docker-deployment.md (1)
286-301
: MCP transport/docs are stale (SSE/WebSocket); switch to HTTPThis doc still lists
/mcp/sse
and/mcp/ws
and showsclaude mcp add --transport sse
. The implementation uses FastMCP HTTP at/mcp
. Update both the endpoint list and the Claude command.-The Crawl4AI server exposes two MCP endpoints: - -- **Server-Sent Events (SSE)**: `http://localhost:11235/mcp/sse` -- **WebSocket**: `ws://localhost:11235/mcp/ws` +The Crawl4AI server exposes a FastMCP HTTP endpoint: + +- **HTTP**: `http://localhost:11235/mcp` @@ -claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse +claude mcp add --transport http c4ai-http http://localhost:11235/mcpAlso applies to: 296-301
deploy/docker/app/api.py (1)
599-607
: Don’t wrap HTTPException.detail in a JSON string (breaks structured error handling).Error middleware expects detail as a dict (or use ErrorContext). Passing a JSON string degrades error typing.
- raise HTTPException( - status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=json_module.dumps({ # Send structured error - "error": str(e), - "server_memory_delta_mb": mem_delta_mb, - "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0) - }) - ) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail={ + "error": str(e), + "server_memory_delta_mb": mem_delta_mb, + "server_peak_memory_mb": max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb_error or 0), + }, + )
🧹 Nitpick comments (58)
crawl4ai/async_configs.py (3)
617-622
: Improve null safety with explicit type checking.The current implementation handles
data is None
but doesn't explicitly validate the type ofconfig
after deserialization. Consider adding more defensive checks.@staticmethod def load(data: dict) -> "BrowserConfig": if data is None: return BrowserConfig() config = from_serializable_dict(data) if isinstance(config, BrowserConfig): return config + if not isinstance(config, (dict, type(None))): + # Log warning about unexpected type but continue gracefully + config = {} return BrowserConfig.from_kwargs(config if config is not None else {})
1430-1432
: Move validation message to a class constant.The static analysis tool correctly identifies that long exception messages should be defined as class constants rather than inline strings for better maintainability.
+class CrawlerRunConfig(): + _WAIT_FOR_ERROR_MSG = "'wait_for' must be a string (e.g., a CSS selector or JS expression)." + _UNWANTED_PROPS = { 'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED', # ... rest of the dictionary } def __setattr__(self, name, value): """Handle attribute setting.""" # TODO: Planning to set properties dynamically based on the __init__ signature sig = inspect.signature(self.__init__) all_params = sig.parameters # Dictionary of parameter names and their details - if name == 'wait_for' and value is not None and not isinstance(value, str): - raise ValueError("'wait_for' must be a string (e.g., a CSS selector or JS expression).") + if name == 'wait_for' and value is not None and not isinstance(value, str): + raise ValueError(self._WAIT_FOR_ERROR_MSG)
1556-1561
: Improve null safety with explicit type checking.Similar to
BrowserConfig.load()
, this method should add more defensive checks for the deserialized config type.@staticmethod def load(data: dict) -> "CrawlerRunConfig": if data is None: return CrawlerRunConfig() config = from_serializable_dict(data) if isinstance(config, CrawlerRunConfig): return config + if not isinstance(config, (dict, type(None))): + # Log warning about unexpected type but continue gracefully + config = {} return CrawlerRunConfig.from_kwargs(config if config is not None else {})deploy/docker/core/__init__.py (1)
1-2
: LGTM: minimal and clear package surface.Docstring is fine. Consider exporting stable symbols via all when you add first public APIs to keep the core surface intentional.
Dockerfile (3)
171-175
: Avoid duplicate COPY; static is already included.
COPY deploy/docker/ ${APP_HOME}/
already bringsstatic/
. The second COPY is redundant and increases layer size.Apply:
-# copy the playground + any future static assets -COPY deploy/docker/static ${APP_HOME}/static
192-199
: Expose app port too.Image metadata only exposes Redis (6379). Add 11235 for clarity and tooling.
EXPOSE 6379 +EXPOSE 11235
176-181
: Ensure export dir exists and is writable.
binary.default_dir
defaults to/tmp/crawl4ai-exports
(config.yml). Create it at build time and chown to appuser to avoid runtime write failures.RUN chown -R appuser:appuser ${APP_HOME} +RUN mkdir -p /tmp/crawl4ai-exports && chown -R appuser:appuser /tmp/crawl4ai-exports
deploy/docker/config.yml (2)
90-93
: Pre-create binary output dir or gate by mode.With
default_mode: "inline"
you're fine, but when switched tofile
, ensure/tmp/crawl4ai-exports
exists and is writable byappuser
(see Dockerfile suggestion). Consider making dir configurable via env for containerized deployments.
63-75
: Browser insecure flags: confirm production intent.
--disable-web-security
and--ignore-certificate-errors
are risky outside test. Gate them behind an env/config switch (e.g.,browser.insecure: false
) and conditionally append flags.pyproject.toml (3)
39-41
: Duplicate httpx requirement; unify with extras.You list both
httpx>=0.27.2
andhttpx[http2]>=0.27.2
. Keep one with the extra.- "httpx>=0.27.2", - "httpx[http2]>=0.27.2", + "httpx[http2]>=0.27.2",
27-27
: Outdated Requests pin; upgrade for security posture.
requests~=2.26
is several years old. Recommend>=2.31,<3
(or latest stable you’ve validated).- "requests~=2.26", + "requests>=2.31,<3",
98-105
: Dev group repeats httpx already in main deps.Remove the duplicate
httpx
fromdev
group to avoid confusion.- "httpx>=0.27.0", # For test client
docker-compose.yml (1)
16-21
: Compose healthcheck overrides image healthcheck.Good. Once Dockerfile healthcheck is fixed, compose remains simpler and fast. No change needed.
tests/mcp/conftest.py (1)
308-321
: Capturerequest
via the fixture instead of pushing it onto callersRequiring every test to pass
request
into the factory breaks existing call sites that just domake_export_path("foo.json")
and forces extra boilerplate. Have the fixture depend onrequest
directly and close over it so the outward-facing API stays the same.-@pytest.fixture -def make_export_path(): +@pytest.fixture +def make_export_path(request): """Factory to create unique, per-test export paths under the allowed base dir. Ensures xdist-safe paths by segregating by worker id and test nodeid. """ base_dir = Path(os.environ.get("MCP_EXPORT_DIR", "/tmp/crawl4ai-exports")) worker = os.environ.get("PYTEST_XDIST_WORKER", "gw0") - def factory(request, filename: str) -> str: + def factory(filename: str) -> str: safe_nodeid = request.node.nodeid.replace(os.sep, "_").replace("::", "_") dir_path = base_dir / f"pytest-{worker}" dir_path.mkdir(parents=True, exist_ok=True) return str(dir_path / f"e2e-{safe_nodeid}-{filename}")deploy/docker/README.md (2)
414-433
: Execute JS docs contradict implemented/tested behavior; update to surface return valuesREADME states results “only report success or errors—returned values are not surfaced,” but tests expect
js_execution_result.results
with returned values and error objects, and docs/md_v2 already describereturn_values
. Align this section with the new contract.Apply this diff:
-Executes JavaScript snippets against a fresh instance of the target page and -returns the resulting crawl data. +Executes JavaScript snippets against a fresh instance of the target page and +returns `js_execution_result` with per‑script statuses and `return_values`. @@ -- `scripts`: List of JavaScript expressions (typically self-invoking - functions) that run sequentially in the page context. There is no `page` - handle; use DOM APIs such as `document` or `window`. -- Results only report success or errors—returned values are not surfaced. Run - all related snippets in a single call; each request creates and tears down a - fresh page. +- `scripts`: JavaScript expressions (often self‑invoking functions) run sequentially in the page context (no `page` handle; use DOM APIs like `document`/`window`). +- Returned values are surfaced via `js_execution_result.results[i].return_value`. Errors appear as structured objects at the corresponding positions. +- Each request creates a fresh page; group related snippets into a single call.Also applies to: 427-433
344-346
: “Testing MCP Connections” still references WebSocket test; prefer HTTP exampleThis section points to
tests/mcp/test_mcp_socket.py
, but the PR migrates to FastMCP HTTP. Provide an HTTP-based test/command (e.g.,tests/mcp/test_mcp_http_listing.py
ortest_e2e_mcp.py
) here to avoid confusion.-# From the repository root -python tests/mcp/test_mcp_socket.py +# From the repository root (HTTP transport) +python tests/mcp/test_mcp_http_listing.py +# Or run the full MCP E2E (schemas & validation) +pytest -q tests/mcp/test_e2e_mcp.pydocs/md_v2/core/docker-deployment.md (3)
75-95
: Switch .llm.env → .env to match new single-env approachAlign this page with the new
.env
convention used elsewhere (and shipped.env.example
).-If you plan to use LLMs, create a `.llm.env` file in your working directory: +If you plan to use LLMs, create a `.env` file in your working directory: @@ -# Create a .llm.env file with your API keys -cat > .llm.env << EOL +# Create a .env file with your API keys +cat > .env << EOL @@ -> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control. +> 🔑 **Note**: Keep your API keys secure! Never commit `.env` to version control. @@ -cp deploy/docker/.llm.env.example .llm.env +cp .env.example .env @@ -# Now edit .llm.env and add your API keys +# Now edit .env and add your API keys @@ -# Or in your .llm.env file: -# LLM_PROVIDER=anthropic/claude-3-opus +# Or in your .env file: +# LLM_PROVIDER=anthropic/claude-3-opusAlso applies to: 146-155, 159-166
691-691
: Default port should be 11235 to match server and READMEConfig snippet still shows
app.port: 8020
. Update to11235
for consistency.- port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf). + port: 11235 # NOTE: Used only when running server.py directly. Gunicorn overrides this (see supervisord.conf).
61-71
: Normalize Docker image version references
docs/md_v2/core/docker-deployment.md (lines 61–71, 127–134): replace hard-coded0.7.3
with generic “latest stable” guidance (e.g.docker pull unclecode/crawl4ai:${TAG:-latest}
) to align with other docs.tests/mcp/test_backward_compatibility.py (2)
13-15
: Make test deterministic and CI-friendly (no hard localhost dependency, add timeouts, narrow exceptions)
- Parameterize
BASE_URL
via env and skip if not set, or use FastAPITestClient
for in-process tests.- Add
timeout
tohttpx.AsyncClient
.- Avoid catching bare
Exception
; catchhttpx.HTTPError
/asyncio.TimeoutError
.+import os @@ -BASE_URL = "http://localhost:11235" +BASE_URL = os.environ.get("MCP_E2E_BASE_URL") # e.g., http://localhost:11235 @@ - async with httpx.AsyncClient() as client: + async with httpx.AsyncClient(timeout=30.0) as client: @@ - async with httpx.AsyncClient() as client: + async with httpx.AsyncClient(timeout=30.0) as client: @@ - async with httpx.AsyncClient() as client: + async with httpx.AsyncClient(timeout=30.0) as client: @@ - async with httpx.AsyncClient() as client: + async with httpx.AsyncClient(timeout=30.0) as client: @@ - for test in tests: + if not BASE_URL: + print("Skipping: set MCP_E2E_BASE_URL to run these live HTTP tests.") + return True + for test in tests: @@ - except Exception as e: + except (httpx.HTTPError, asyncio.TimeoutError) as e: print(f"✗ Test {test.__name__} errored: {e}") results.append(False)Optionally add a pytest wrapper and mark these as integration tests.
Also applies to: 121-156
1-1
: Remove shebang or make file executableShebang is unnecessary for a pytest module; remove it to satisfy EXE001 (or
chmod +x
if intentionally runnable).tests/mcp/test_e2e_mcp.py (2)
54-54
: Drop redundant f-string prefixLiteral has no placeholders; remove
f
to satisfy Ruff F541.- assert tool.name, f"Tool missing name" + assert tool.name, "Tool missing name"
1-1
: Remove shebang or make file executableFor pytest modules, the shebang isn’t needed.
tests/mcp/test_e2e_api.py (4)
32-41
: Don’t shadow the shared make_export_path; reuse the conftest factory to avoid path collisionsThis local fixture ignores the xdist‑safe factory in
tests/mcp/conftest.py
and hardcodes/tmp/crawl4ai-exports
. Remove this fixture and use the shared one to avoid collisions in parallel runs.-@pytest.fixture -def make_export_path(): - """Factory for test export paths.""" - base_dir = Path(os.environ.get("MCP_EXPORT_DIR", "/tmp/crawl4ai-exports")) - base_dir.mkdir(parents=True, exist_ok=True) - - def factory(filename: str) -> str: - return str(base_dir / f"test-api-{filename}") - - return factory +# Reuse the xdist-safe factory from conftest.py
426-462
: Tighten expectations for malformed JSStatus 200 with structured
js_execution_result.results[i]
is good. Add a check for an explicitsuccess: False
orerror
field on the failing item to assert the contract more crisply.
170-196
: External network reliance—prefer marks or mocksScreenshot/PDF tests hit external sites. Consider marking as
@pytest.mark.network
(opt‑in) or mocking the renderer to avoid external dependencies.Also applies to: 212-240, 294-305, 481-492
1-1
: Remove shebangNot needed for pytest modules.
tests/mcp/test_mcp_js_exec.py (2)
61-63
: Use f-string conversion flags instead of repr().Prefer {val!r} over explicit repr() inside f-strings.
- print(f"Script {i+1} returned: {repr(value)}") + print(f"Script {i+1} returned: {value!r}") @@ - print(f"✅ Simple string test: '{test_str}'") + print(f"✅ Simple string test: {test_str!r}") @@ - print(f"✅ Simple number test: {test_num}") + print(f"✅ Simple number test: {test_num!r}")Also applies to: 91-97, 99-104
1-1
: Remove shebang or make file executable.Shebang isn’t needed for pytest; drop it or chmod +x to satisfy linters.
deploy/docker/middleware/error_middleware.py (2)
29-36
: Avoid creating a fake Exception just to mint a correlation ID.Generate an ID directly; cheaper and clearer.
-from core.error_context import ErrorContext, set_correlation_id +from core.error_context import ErrorContext, set_correlation_id +from uuid import uuid4 @@ - correlation_id = request.headers.get("X-Correlation-ID", "") - if not correlation_id: - error_ctx = ErrorContext.from_exception(Exception()) - correlation_id = error_ctx.correlation_id + correlation_id = request.headers.get("X-Correlation-ID") or uuid4().hex
62-71
: Map more error types where applicable (future-proofing).Consider mapping auth failures to 401 and rate-limit to 429 when ErrorContext supports them.
deploy/docker/core/result_handler.py (2)
40-45
: Ensure dict keys are JSON-safe; coerce to strings.Non-string keys in nested dicts will cause serialization failure and drop entire fields later.
- if isinstance(val, dict): - return {k: ResultHandler.normalize_value(v) for k, v in val.items()} + if isinstance(val, dict): + return {str(k): ResultHandler.normalize_value(v) for k, v in val.items()}
37-39
: Guard against huge byte payloads.Base64-encoding very large bytes may blow memory. Consider size caps or streaming.
deploy/docker/app/api.py (5)
393-395
: Drop redundant re-import; use the module-level datetime_handler.Fixes Ruff F811 and avoids shadowing.
- import json - from app.utils import datetime_handler + import json
427-429
: Use Optional[str] for output_path type hint.Aligns with PEP 484 and Ruff RUF013.
- config: dict, - output_path: str = None + config: dict, + output_path: Optional[str] = None
542-580
: Minor: improve error context on file write failures.Use exception chaining and explicit conversion flags; consider ErrorContext for consistency.
- except (OSError, TypeError) as write_error: - logger.warning("crawl file write failed: %s", write_error, exc_info=True) - raise HTTPException( - status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=f"Failed to write to {abs_path}: {str(write_error)}" - ) + except (OSError, TypeError) as write_error: + logger.warning("crawl file write failed: %s", write_error, exc_info=True) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Failed to write to {abs_path}: {write_error!s}", + ) from write_error
529-541
: Consider returning a typed StandardizedResponse for consistency.You already import StandardizedResponse; using it clarifies schema and keeps responses uniform.
- response_data = { - "success": True, - "results": processed_results, - "server_processing_time_s": end_time - start_time, - "server_memory_delta_mb": mem_delta_mb, - "server_peak_memory_mb": peak_mem_mb - } + response_data = StandardizedResponse.success_response( + results=processed_results, + processing_time=end_time - start_time, + memory_delta=mem_delta_mb, + memory_peak=peak_mem_mb, + ).model_dump()
482-489
: Optional: close async generator when breaking early to free resources.If results_gen supports aclose(), call it after breaking on limit.
async for result in results_gen: results.append(result) if limit and len(results) >= limit: - break + try: + await results_gen.aclose() # best effort + except Exception: + pass + breaktests/mcp/test_mcp_socket.py (3)
13-18
: Skip pytest collection for this interactive smoke testThese
test_*
coroutines with aclient
parameter will be collected by pytest and fail due to a missing fixture. Mark the module as skipped so it’s runnable viapython
but not part of CI.import anyio +import pytest from fastmcp.client import Client MCP_URL = os.getenv("MCP_URL", f"http://localhost:{os.getenv('HOST_PORT', '11235')}/mcp") +pytestmark = pytest.mark.skip(reason="Interactive MCP HTTP smoke test; run with `python tests/mcp/test_mcp_socket.py`")
91-96
: Align JS snippets with server-side contract (nopage
symbol)Server’s
/execute_js
docs state snippets execute in page scope without apage
variable. These examples usepage.click
/waitForTimeout
and may fail. Use DOM APIs instead.- "scripts": [ - "await page.click('a.morelink')", - "await page.waitForTimeout(1000)", - ], + "scripts": [ + "(async () => { document.querySelector('a.morelink')?.click(); })()", + "(async () => { await new Promise(r => setTimeout(r, 1000)); })()", + ],
20-27
: Harden JSON parsing in the smoke printsIf a tool returns non‑JSON text,
json.loads(...)
will raise and stop the run. Consider a tiny helper that returnsdict(text=<first 200 chars>)
on decode failure, so the script keeps going.Also applies to: 39-55, 66-71, 80-85, 98-100, 109-111, 122-124
deploy/docker/app/schemas.py (2)
9-10
: Tighten types for configsPrefer
Dict[str, Any]
over bareDict
for clarity and better editor support.- browser_config: Optional[Dict] = Field(default_factory=dict) - crawler_config: Optional[Dict] = Field(default_factory=dict) + browser_config: Optional[Dict[str, Any]] = Field(default_factory=dict) + crawler_config: Optional[Dict[str, Any]] = Field(default_factory=dict)
49-53
: Enforce at least one JS snippetAdd
min_length=1
so empty lists are rejected at validation time.- scripts: List[str] = Field( - ..., - description="List of separated JavaScript snippets to execute" - ) + scripts: List[str] = Field( + ..., + min_length=1, + description="List of separated JavaScript snippets to execute" + )deploy/docker/server.py (7)
286-290
: Chain exceptions and use explicit conversion flagImprove traceability and satisfy linters by chaining and using
!s
in f‑strings.- except ValueError as e: - raise HTTPException(400, f"Configuration validation error: {str(e)}") - except Exception as e: - raise HTTPException(400, f"Failed to parse configuration: {str(e)}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.") + except ValueError as e: + raise HTTPException(400, f"Configuration validation error: {e!s}") from e + except Exception as e: + raise HTTPException( + 400, + f"Failed to parse configuration: {e!s}. Ensure you're using valid CrawlerRunConfig() or BrowserConfig() syntax.", + ) from e
613-617
: Avoid logging potentially large JS results at INFO
js_execution_result
can be large. Log a summary at DEBUG to keep logs lean.- logger.info(f"JS execution results: {data['js_execution_result']}") + logger.debug("JS execution results keys=%s", list(data['js_execution_result'].keys()))
129-146
: Simplify combined lifespan logicChecks against
crawler_lifespan is None
are never true and add noise. Streamline the combinator.@asynccontextmanager async def combined_lifespan(app): - if mcp_app.lifespan is None and crawler_lifespan is None: - yield - return - if mcp_app.lifespan is None: - async with crawler_lifespan(app): - yield - return - if crawler_lifespan is None: - async with mcp_app.lifespan(app): - yield - return - - async with mcp_app.lifespan(app): - async with crawler_lifespan(app): - yield + if mcp_app.lifespan: + async with mcp_app.lifespan(app): + async with crawler_lifespan(app): + yield + else: + async with crawler_lifespan(app): + yield
888-893
: Mount MCP app under/mcp
and use logger instead of printsMounting at
/
works but is surprising; mounting at/mcp
is clearer and avoids any routing ambiguity. Also preferlogger
over-# Mount MCP app at root level but preserve our FastAPI routes -# Known limitation: MCP tools don't forward JWT Authorization headers when security.jwt_enabled=true -# Internal proxy calls will fail authentication. Disable JWT or implement header forwarding for MCP usage. -app.mount("/", app=mcp_app, name="mcp") -print("Mounted MCP app at / (MCP endpoint at /mcp)") -print(f"MCP schema available at http://{config['app']['host']}:{config['app']['port']}/mcp/schema") +# Mount MCP app under /mcp (routes remain intact) +# Known limitation: MCP tools don't forward JWT Authorization headers when security.jwt_enabled=true +# Disable JWT or implement header forwarding for MCP usage. +app.mount("/mcp", app=mcp_app, name="mcp") +logger.info("Mounted MCP app at /mcp") +logger.info("MCP schema available at http://%s:%s/mcp/schema", config['app']['host'], config['app']['port'])
83-85
: Remove unused inline-binary limits or apply them
INLINE_BINARY_MAX_BYTES/INLINE_BINARY_MAX_BASE64_CHARS
are defined but unused. Either enforce truncation/auto-file mode using them or drop the constants.
194-204
: Minor: middleware import localityImporting
ErrorHandlingMiddleware
inside the function scope is fine, but a top-level import keeps imports consistent and avoids surprises during app startup errors.
858-861
: Tool registration timingTool extraction/registration runs at import time. If registration depends on dynamic config (e.g., JWT enabled) you may want to guard or delay until after app startup to avoid registering unusable tools under JWT.
deploy/docker/core/error_context.py (3)
311-314
: Propagate correlation ID via HTTP headers.Include
X-Correlation-ID
on raised HTTP exceptions for cross-service tracing.Apply this diff:
return HTTPException( status_code=status_code, - detail=self.to_error_response().model_dump() + detail=self.to_error_response().model_dump(), + headers={"X-Correlation-ID": self.correlation_id}, )
191-206
: Expand HTTP status mapping (401/403/429) for clearer client guidance.Add common statuses to
status_map
and suggestions.Apply this diff:
status_map = { + 401: ("unauthorized", "Provide valid credentials (Authorization header)"), + 403: ("forbidden", "Ensure your credentials have sufficient permissions"), + 429: ("rate_limited", "Reduce request rate and retry after a delay"), }
1-1
: Minor: header comment path mismatch.File header says
deploy/docker/error_context.py
; actual path isdeploy/docker/core/error_context.py
. Consider updating for consistency.deploy/docker/core/mcp_service.py (5)
71-76
: Use lstrip for OpenAPI $ref resolution.
strip("#/")
can remove characters from the end;lstrip
is safer for prefix removal.Apply this diff:
- def resolve_ref(ref): - parts = ref.strip("#/").split("/") + def resolve_ref(ref): + parts = ref.lstrip("#/").split("/") d = openapi_schema for part in parts: d = d[part] return d
97-105
: Avoid mutating the OpenAPI schema object when composing input schemas.Deep‑copy referenced/requestBody schemas before augmentation.
Apply this diff:
- if "requestBody" in operation: + if "requestBody" in operation: request_body = operation.get("requestBody", {}) - if "content" in request_body and "$ref" in request_body.get("content", {}).get("application/json", {}).get("schema", {}): + if "content" in request_body and "$ref" in request_body.get("content", {}).get("application/json", {}).get("schema", {}): schema_ref = request_body["content"]["application/json"]["schema"]["$ref"] - input_schema = resolve_ref(schema_ref) + input_schema = resolve_ref(schema_ref).copy() elif "content" in request_body: - input_schema = request_body["content"].get("application/json", {}).get("schema", {}) + input_schema = request_body["content"].get("application/json", {}).get("schema", {}).copy()Additional change outside this range (import at top):
import copy # if you choose copy.deepcopy() instead of .copy()
174-176
: Deterministic HTTP method selection.Prefer GET if present; otherwise pick the first sorted method for stability.
Apply this diff:
- methods = route.methods - {"HEAD", "OPTIONS"} - http_method = next(iter(methods)) if methods else "GET" + methods = (route.methods or set()) - {"HEAD", "OPTIONS"} + http_method = "GET" if "GET" in methods else (sorted(methods)[0] if methods else "GET")
135-140
: Connection pooling and resource reuse for proxies.Create and reuse a single
httpx.AsyncClient
perMCPService
with tuned limits/timeouts; expose a.aclose()
for shutdown. This reduces latency and GC overhead versus constructing a client per call. Based on learnings.I can provide a patch plus a small integration note to close the client at app shutdown if you confirm where lifecycle events are registered (e.g., in
server.py
).
125-129
: Optional: include required flags in inputSchema.When merging OpenAPI parameters, set
inputSchema["required"]
based onparam.get("required")
for better client validation in MCP.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (36)
Dockerfile
(1 hunks)crawl4ai/async_configs.py
(3 hunks)deploy/docker/.env.example
(1 hunks)deploy/docker/.llm.env.example
(0 hunks)deploy/docker/README.md
(12 hunks)deploy/docker/app/__init__.py
(1 hunks)deploy/docker/app/api.py
(14 hunks)deploy/docker/app/crawler_pool.py
(1 hunks)deploy/docker/app/schemas.py
(1 hunks)deploy/docker/app/utils.py
(3 hunks)deploy/docker/config.yml
(2 hunks)deploy/docker/core/__init__.py
(1 hunks)deploy/docker/core/error_context.py
(1 hunks)deploy/docker/core/job.py
(3 hunks)deploy/docker/core/mcp_service.py
(1 hunks)deploy/docker/core/result_handler.py
(1 hunks)deploy/docker/mcp_bridge.py
(0 hunks)deploy/docker/middleware/__init__.py
(1 hunks)deploy/docker/middleware/error_middleware.py
(1 hunks)deploy/docker/requirements.txt
(1 hunks)deploy/docker/schemas.py
(0 hunks)deploy/docker/server.py
(17 hunks)deploy/docker/static/playground/index.html
(2 hunks)docker-compose.yml
(2 hunks)docs/md_v2/assets/llm.txt/txt/docker.txt
(1 hunks)docs/md_v2/core/docker-deployment.md
(1 hunks)pyproject.toml
(1 hunks)tests/docker/test_config_object.py
(3 hunks)tests/mcp/conftest.py
(1 hunks)tests/mcp/test_backward_compatibility.py
(1 hunks)tests/mcp/test_e2e_api.py
(1 hunks)tests/mcp/test_e2e_mcp.py
(1 hunks)tests/mcp/test_mcp_http_listing.py
(1 hunks)tests/mcp/test_mcp_js_exec.py
(1 hunks)tests/mcp/test_mcp_socket.py
(1 hunks)tests/mcp/test_mcp_sse.py
(0 hunks)
💤 Files with no reviewable changes (4)
- deploy/docker/.llm.env.example
- deploy/docker/schemas.py
- tests/mcp/test_mcp_sse.py
- deploy/docker/mcp_bridge.py
🧰 Additional context used
🧬 Code graph analysis (17)
tests/mcp/test_backward_compatibility.py (2)
tests/mcp/test_e2e_api.py (1)
client
(23-29)tests/mcp/test_mcp_http_listing.py (1)
main
(17-25)
tests/mcp/test_mcp_http_listing.py (1)
tests/mcp/test_mcp_socket.py (1)
main
(126-137)
tests/docker/test_config_object.py (2)
crawl4ai/deep_crawling/filters.py (3)
FilterChain
(69-116)ContentTypeFilter
(258-421)DomainFilter
(424-498)crawl4ai/deep_crawling/scorers.py (1)
KeywordRelevanceScorer
(160-188)
deploy/docker/core/result_handler.py (1)
crawl4ai/models.py (1)
model_dump
(240-258)
tests/mcp/test_e2e_mcp.py (1)
tests/mcp/test_e2e_api.py (1)
client
(23-29)
tests/mcp/conftest.py (1)
tests/mcp/test_e2e_api.py (3)
client
(23-29)make_export_path
(33-41)factory
(38-39)
tests/mcp/test_e2e_api.py (1)
tests/mcp/conftest.py (2)
make_export_path
(308-322)factory
(316-320)
tests/mcp/test_mcp_socket.py (1)
tests/mcp/test_mcp_http_listing.py (1)
main
(17-25)
deploy/docker/core/error_context.py (1)
deploy/docker/app/schemas.py (6)
ErrorResponse
(96-199)network_error
(107-121)validation_error
(124-141)security_error
(144-161)configuration_error
(164-179)processing_error
(182-199)
deploy/docker/middleware/error_middleware.py (1)
deploy/docker/core/error_context.py (5)
ErrorContext
(38-335)set_correlation_id
(32-34)from_exception
(163-186)log
(316-335)to_error_response
(259-295)
crawl4ai/async_configs.py (1)
tests/docker/test_serialization.py (1)
from_serializable_dict
(67-106)
deploy/docker/app/crawler_pool.py (1)
deploy/docker/app/utils.py (1)
load_config
(22-40)
deploy/docker/core/mcp_service.py (1)
deploy/docker/server.py (1)
get_mcp_schema
(865-883)
deploy/docker/app/utils.py (2)
deploy/docker/core/error_context.py (3)
ErrorContext
(38-335)security_error
(105-124)to_http_exception
(297-314)deploy/docker/app/schemas.py (1)
security_error
(144-161)
deploy/docker/app/schemas.py (2)
deploy/docker/app/utils.py (1)
FilterType
(16-20)deploy/docker/core/error_context.py (5)
network_error
(67-81)validation_error
(84-102)security_error
(105-124)configuration_error
(127-141)processing_error
(144-160)
deploy/docker/app/api.py (4)
deploy/docker/core/error_context.py (2)
parse_network_error
(338-374)to_http_exception
(297-314)deploy/docker/core/result_handler.py (3)
ResultHandler
(13-169)normalize_results
(90-133)normalize_single_result
(49-87)deploy/docker/app/schemas.py (1)
StandardizedResponse
(55-93)deploy/docker/app/utils.py (4)
get_llm_api_key
(73-93)validate_llm_provider
(96-116)datetime_handler
(57-61)_ensure_within_base_dir
(128-169)
deploy/docker/server.py (7)
deploy/docker/app/api.py (5)
handle_crawl_request
(423-607)handle_llm_qa
(68-114)handle_markdown_request
(181-245)handle_stream_crawl_request
(609-658)stream_results
(390-421)deploy/docker/app/crawler_pool.py (3)
close_all
(47-50)get_crawler
(22-46)janitor
(52-60)deploy/docker/core/mcp_service.py (4)
MCPService
(50-487)extract_tools_from_routes
(141-166)register_tools
(221-230)get_mcp_schema
(60-133)deploy/docker/app/schemas.py (6)
CrawlRequest
(7-11)MarkdownRequest
(13-28)ErrorResponse
(96-199)PDFRequest
(42-44)RawCode
(31-32)ScreenshotRequest
(37-40)deploy/docker/app/utils.py (2)
FilterType
(16-20)_ensure_within_base_dir
(128-169)deploy/docker/core/error_context.py (4)
ErrorContext
(38-335)parse_network_error
(338-374)parse_security_error
(377-392)to_http_exception
(297-314)deploy/docker/middleware/error_middleware.py (1)
ErrorHandlingMiddleware
(18-76)
🪛 Ruff (0.13.1)
tests/mcp/test_backward_compatibility.py
1-1: Shebang is present but file is not executable
(EXE001)
138-138: Do not catch blind exception: Exception
(BLE001)
tests/mcp/test_mcp_js_exec.py
1-1: Shebang is present but file is not executable
(EXE001)
27-27: Probable use of requests
call without timeout
(S113)
62-62: Use explicit conversion flag
Replace with conversion flag
(RUF010)
93-93: Use explicit conversion flag
Replace with conversion flag
(RUF010)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
118-118: Probable use of requests
call without timeout
(S113)
deploy/docker/core/result_handler.py
167-167: Consider moving this statement to an else
block
(TRY300)
tests/mcp/test_e2e_mcp.py
1-1: Shebang is present but file is not executable
(EXE001)
54-54: f-string without any placeholders
Remove extraneous f
prefix
(F541)
tests/mcp/conftest.py
37-37: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
46-47: try
-except
-pass
detected, consider logging the exception
(S110)
46-46: Do not catch blind exception: Exception
(BLE001)
64-64: Do not catch blind exception: Exception
(BLE001)
65-65: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
65-65: Avoid specifying long messages outside the exception class
(TRY003)
74-75: try
-except
-pass
detected, consider logging the exception
(S110)
74-74: Do not catch blind exception: Exception
(BLE001)
77-77: Avoid specifying long messages outside the exception class
(TRY003)
113-114: try
-except
-pass
detected, consider logging the exception
(S110)
113-113: Do not catch blind exception: Exception
(BLE001)
139-139: Do not catch blind exception: Exception
(BLE001)
160-161: try
-except
-pass
detected, consider logging the exception
(S110)
160-160: Do not catch blind exception: Exception
(BLE001)
163-164: try
-except
-pass
detected, consider logging the exception
(S110)
163-163: Do not catch blind exception: Exception
(BLE001)
183-184: try
-except
-pass
detected, consider logging the exception
(S110)
183-183: Do not catch blind exception: Exception
(BLE001)
203-203: Do not catch blind exception: Exception
(BLE001)
203-203: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
216-216: Do not catch blind exception: Exception
(BLE001)
262-263: try
-except
-pass
detected, consider logging the exception
(S110)
262-262: Do not catch blind exception: Exception
(BLE001)
313-313: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
354-355: try
-except
-pass
detected, consider logging the exception
(S110)
354-354: Do not catch blind exception: Exception
(BLE001)
358-359: try
-except
-pass
detected, consider logging the exception
(S110)
358-358: Do not catch blind exception: Exception
(BLE001)
tests/mcp/test_e2e_api.py
1-1: Shebang is present but file is not executable
(EXE001)
35-35: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
tests/mcp/test_mcp_socket.py
26-26: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/core/error_context.py
182-182: Use explicit conversion flag
Replace with conversion flag
(RUF010)
deploy/docker/middleware/error_middleware.py
40-40: Consider moving this statement to an else
block
(TRY300)
crawl4ai/async_configs.py
1431-1431: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/core/mcp_service.py
80-80: Do not call getattr
with a constant attribute value. It is not any safer than normal property access.
Replace getattr
with attribute access
(B009)
207-207: Avoid specifying long messages outside the exception class
(TRY003)
deploy/docker/app/api.py
88-88: Abstract raise
to an inner function
(TRY301)
234-234: Abstract raise
to an inner function
(TRY301)
393-393: Redefinition of unused datetime_handler
from line 51
Remove definition: datetime_handler
(F811)
428-428: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
440-444: Abstract raise
to an inner function
(TRY301)
493-493: Do not catch blind exception: Exception
(BLE001)
508-508: Do not catch blind exception: Exception
(BLE001)
544-544: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
576-579: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
578-578: Use explicit conversion flag
Replace with conversion flag
(RUF010)
581-581: Consider moving this statement to an else
block
(TRY300)
deploy/docker/server.py
109-109: Possible binding to all interfaces
(S104)
226-226: Probable insecure usage of temporary file or directory: "/tmp/crawl4ai-exports"
(S108)
287-287: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
287-287: Use explicit conversion flag
Replace with conversion flag
(RUF010)
288-288: Do not catch blind exception: Exception
(BLE001)
289-289: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
289-289: Use explicit conversion flag
Replace with conversion flag
(RUF010)
324-324: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
875-875: Do not catch blind exception: Exception
(BLE001)
877-877: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
879-879: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🪛 dotenv-linter (3.3.0)
deploy/docker/.env.example
[warning] 16-16: [UnorderedKey] The DEEPSEEK_API_KEY key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 17-17: [UnorderedKey] The ANTHROPIC_API_KEY key should go before the DEEPSEEK_API_KEY key
(UnorderedKey)
[warning] 18-18: [UnorderedKey] The GROQ_API_KEY key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 20-20: [UnorderedKey] The MISTRAL_API_KEY key should go before the OPENAI_API_KEY key
(UnorderedKey)
[warning] 21-21: [UnorderedKey] The GEMINI_API_TOKEN key should go before the GROQ_API_KEY key
(UnorderedKey)
[warning] 62-62: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
🔇 Additional comments (7)
deploy/docker/middleware/__init__.py (1)
1-2
: Module docstring looks goodClean placeholder docstring accurately describes the package purpose. No further changes needed.
deploy/docker/requirements.txt (1)
14-14
: Container Python version (3.11.2) satisfies FastMCP’s Python ≥3.10 requirement.crawl4ai/async_configs.py (1)
1430-1431
: String-only enforcement forwait_for
is safe
Search across all async code, tests, and examples showswait_for
is only ever passed a string (or left asNone
), so this validation won’t break existing async use cases.deploy/docker/config.yml (1)
87-88
: LGTM: unified health endpoint.
/health
aligns with Dockerfile/compose checks.Confirm server.py actually mounts
/health
on 11235 in this refactor.docker-compose.yml (1)
4-7
: Nice: parameterized host port and unified env file.This improves DX and portability.
deploy/docker/README.md (1)
75-95
: Ignore.llm.env
deprecation suggestion
No occurrences of.llm.env
were found in the repository; docs already reference only.env
.Likely an incorrect or invalid review comment.
deploy/docker/core/mcp_service.py (1)
243-293
: Typed‐path params and multi‐value/query cases aren’t used by any current MCP tools; existing tests cover body+query behavior for GET/POST. Add unit tests for{param:path}
, repeated path params, empty strings or list‐valued query params only when you introduce endpoints requiring those edge cases.
class StandardizedResponse(BaseModel): | ||
"""Standardized response format for all crawl operations.""" | ||
success: bool = Field(..., description="Whether the operation succeeded") | ||
results: List[Dict[str, Any]] = Field(default_factory=list, description="List of crawl results") | ||
server_processing_time_s: Optional[float] = Field(None, description="Server processing time in seconds") | ||
server_memory_delta_mb: Optional[float] = Field(None, description="Memory usage delta in MB") | ||
server_peak_memory_mb: Optional[float] = Field(None, description="Peak memory usage in MB") | ||
error_message: Optional[str] = Field(None, description="Error message if operation failed") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add correlation IDs to standard responses
PR notes mention “unified ErrorContext with correlation IDs”, but neither StandardizedResponse
nor ErrorResponse
exposes it. Surface correlation_id
for tracing, and accept it in factories.
class StandardizedResponse(BaseModel):
"""Standardized response format for all crawl operations."""
success: bool = Field(..., description="Whether the operation succeeded")
results: List[Dict[str, Any]] = Field(default_factory=list, description="List of crawl results")
server_processing_time_s: Optional[float] = Field(None, description="Server processing time in seconds")
server_memory_delta_mb: Optional[float] = Field(None, description="Memory usage delta in MB")
server_peak_memory_mb: Optional[float] = Field(None, description="Peak memory usage in MB")
error_message: Optional[str] = Field(None, description="Error message if operation failed")
+ correlation_id: Optional[str] = Field(None, description="Correlation ID for request tracing")
@classmethod
def success_response(
cls,
results: List[Dict[str, Any]],
processing_time: Optional[float] = None,
memory_delta: Optional[float] = None,
- memory_peak: Optional[float] = None
+ memory_peak: Optional[float] = None,
+ correlation_id: Optional[str] = None,
) -> "StandardizedResponse":
"""Create a successful response."""
return cls(
success=True,
results=results,
server_processing_time_s=processing_time,
server_memory_delta_mb=memory_delta,
- server_peak_memory_mb=memory_peak
+ server_peak_memory_mb=memory_peak,
+ correlation_id=correlation_id,
)
@classmethod
def error_response(
cls,
error_message: str,
- processing_time: Optional[float] = None
+ processing_time: Optional[float] = None,
+ correlation_id: Optional[str] = None,
) -> "StandardizedResponse":
"""Create an error response."""
return cls(
success=False,
results=[],
error_message=error_message,
- server_processing_time_s=processing_time
+ server_processing_time_s=processing_time,
+ correlation_id=correlation_id,
)
Also applies to: 64-80, 81-93
🤖 Prompt for AI Agents
In deploy/docker/app/schemas.py around lines 55-63 (and similarly apply to 64-80
and 81-93), the standardized response models do not expose a correlation_id for
tracing; add an optional correlation_id: Optional[str] = Field(None,
description="Correlation ID for request tracing") to StandardizedResponse and
ErrorResponse datamodels, update their constructors/factory methods to accept a
correlation_id parameter and populate the field, and ensure type hints, Field
descriptions, and any response-building helper functions include and propagate
this value so correlation IDs are surfaced in all generated responses.
class ErrorResponse(BaseModel): | ||
"""Standardized error response format.""" | ||
success: bool = Field(False, description="Always false for error responses") | ||
error_type: str = Field(..., description="Type of error (network, validation, security, etc.)") | ||
error_message: str = Field(..., description="User-friendly error message") | ||
error_code: Optional[str] = Field(None, description="Specific error code for programmatic handling") | ||
details: Optional[Dict[str, Any]] = Field(None, description="Additional error context") | ||
suggestions: Optional[List[str]] = Field(None, description="Suggested actions to resolve the error") | ||
operation: Optional[str] = Field(None, description="Operation that failed") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Expose correlation IDs on errors
Mirror middleware/ErrorContext by adding correlation_id
to ErrorResponse
.
class ErrorResponse(BaseModel):
"""Standardized error response format."""
success: bool = Field(False, description="Always false for error responses")
error_type: str = Field(..., description="Type of error (network, validation, security, etc.)")
error_message: str = Field(..., description="User-friendly error message")
error_code: Optional[str] = Field(None, description="Specific error code for programmatic handling")
details: Optional[Dict[str, Any]] = Field(None, description="Additional error context")
suggestions: Optional[List[str]] = Field(None, description="Suggested actions to resolve the error")
operation: Optional[str] = Field(None, description="Operation that failed")
+ correlation_id: Optional[str] = Field(None, description="Correlation ID for request tracing")
Also applies to: 106-121
🤖 Prompt for AI Agents
In deploy/docker/app/schemas.py around lines 96-105 (and also apply the same
change to 106-121), the ErrorResponse model is missing the correlation_id field
used by middleware/ErrorContext; add a new field correlation_id: Optional[str] =
Field(None, description="Correlation ID from middleware/ErrorContext") to the
ErrorResponse Pydantic model so error payloads include the request correlation
ID for tracing and ensure the same field is added to the other error/response
model in lines 106-121.
def to_error_response(self) -> ErrorResponse: | ||
"""Convert to ErrorResponse schema.""" | ||
kwargs = { | ||
"error_type": self.error_type, | ||
"error_message": self.message, | ||
"operation": self.operation, | ||
"suggestions": self.suggestions or None, | ||
} | ||
|
||
if self.error_code: | ||
kwargs["error_code"] = self.error_code | ||
|
||
if self.field_name: | ||
kwargs["field_name"] = self.field_name | ||
if self.invalid_value is not None: | ||
kwargs["invalid_value"] = self.invalid_value | ||
if self.expected_format: | ||
kwargs["expected_format"] = self.expected_format | ||
|
||
if self.violation_type: | ||
kwargs["violation_type"] = self.violation_type | ||
if self.attempted_action: | ||
kwargs["attempted_action"] = self.attempted_action | ||
if self.allowed_scope: | ||
kwargs["allowed_scope"] = self.allowed_scope | ||
|
||
if self.config_key: | ||
kwargs["config_key"] = self.config_key | ||
if self.config_value is not None: | ||
kwargs["config_value"] = self.config_value | ||
|
||
if self.processing_stage: | ||
kwargs["processing_stage"] = self.processing_stage | ||
if self.data_type: | ||
kwargs["data_type"] = self.data_type | ||
|
||
return ErrorResponse(**kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: to_error_response passes fields not defined in ErrorResponse (will raise Pydantic validation error).
Map non-core attributes into the details
dict (and include correlation_id
+ context
) instead of passing them as top-level fields.
Apply this diff:
def to_error_response(self) -> ErrorResponse:
- """Convert to ErrorResponse schema."""
- kwargs = {
- "error_type": self.error_type,
- "error_message": self.message,
- "operation": self.operation,
- "suggestions": self.suggestions or None,
- }
-
- if self.error_code:
- kwargs["error_code"] = self.error_code
-
- if self.field_name:
- kwargs["field_name"] = self.field_name
- if self.invalid_value is not None:
- kwargs["invalid_value"] = self.invalid_value
- if self.expected_format:
- kwargs["expected_format"] = self.expected_format
-
- if self.violation_type:
- kwargs["violation_type"] = self.violation_type
- if self.attempted_action:
- kwargs["attempted_action"] = self.attempted_action
- if self.allowed_scope:
- kwargs["allowed_scope"] = self.allowed_scope
-
- if self.config_key:
- kwargs["config_key"] = self.config_key
- if self.config_value is not None:
- kwargs["config_value"] = self.config_value
-
- if self.processing_stage:
- kwargs["processing_stage"] = self.processing_stage
- if self.data_type:
- kwargs["data_type"] = self.data_type
-
- return ErrorResponse(**kwargs)
+ """Convert to ErrorResponse schema."""
+ details: Dict[str, Any] = {}
+ for k, v in [
+ ("field_name", self.field_name),
+ ("invalid_value", self.invalid_value),
+ ("expected_format", self.expected_format),
+ ("violation_type", self.violation_type),
+ ("attempted_action", self.attempted_action),
+ ("allowed_scope", self.allowed_scope),
+ ("config_key", self.config_key),
+ ("config_value", self.config_value),
+ ("processing_stage", self.processing_stage),
+ ("data_type", self.data_type),
+ ("correlation_id", self.correlation_id),
+ ]:
+ if v not in (None, "", []):
+ details[k] = v
+ if self.context:
+ details["context"] = self.context
+
+ return ErrorResponse(
+ error_type=self.error_type,
+ error_message=self.message,
+ error_code=self.error_code,
+ operation=self.operation,
+ suggestions=self.suggestions or None,
+ details=details or None,
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def to_error_response(self) -> ErrorResponse: | |
"""Convert to ErrorResponse schema.""" | |
kwargs = { | |
"error_type": self.error_type, | |
"error_message": self.message, | |
"operation": self.operation, | |
"suggestions": self.suggestions or None, | |
} | |
if self.error_code: | |
kwargs["error_code"] = self.error_code | |
if self.field_name: | |
kwargs["field_name"] = self.field_name | |
if self.invalid_value is not None: | |
kwargs["invalid_value"] = self.invalid_value | |
if self.expected_format: | |
kwargs["expected_format"] = self.expected_format | |
if self.violation_type: | |
kwargs["violation_type"] = self.violation_type | |
if self.attempted_action: | |
kwargs["attempted_action"] = self.attempted_action | |
if self.allowed_scope: | |
kwargs["allowed_scope"] = self.allowed_scope | |
if self.config_key: | |
kwargs["config_key"] = self.config_key | |
if self.config_value is not None: | |
kwargs["config_value"] = self.config_value | |
if self.processing_stage: | |
kwargs["processing_stage"] = self.processing_stage | |
if self.data_type: | |
kwargs["data_type"] = self.data_type | |
return ErrorResponse(**kwargs) | |
def to_error_response(self) -> ErrorResponse: | |
"""Convert to ErrorResponse schema.""" | |
details: Dict[str, Any] = {} | |
for k, v in [ | |
("field_name", self.field_name), | |
("invalid_value", self.invalid_value), | |
("expected_format", self.expected_format), | |
("violation_type", self.violation_type), | |
("attempted_action", self.attempted_action), | |
("allowed_scope", self.allowed_scope), | |
("config_key", self.config_key), | |
("config_value", self.config_value), | |
("processing_stage", self.processing_stage), | |
("data_type", self.data_type), | |
("correlation_id", self.correlation_id), | |
]: | |
if v not in (None, "", []): | |
details[k] = v | |
if self.context: | |
details["context"] = self.context | |
return ErrorResponse( | |
error_type=self.error_type, | |
error_message=self.message, | |
error_code=self.error_code, | |
operation=self.operation, | |
suggestions=self.suggestions or None, | |
details=details or None, | |
) |
🤖 Prompt for AI Agents
In deploy/docker/core/error_context.py around lines 259-296, to_error_response
is currently passing many attributes as top-level kwargs which ErrorResponse
doesn't define; instead build a core kwargs with error_type, error_message,
operation, suggestions (and error_code if present) and collect all non-core
attributes (field_name, invalid_value, expected_format, violation_type,
attempted_action, allowed_scope, config_key, config_value, processing_stage,
data_type, plus correlation_id and context) into a details dict (only include
keys with non-None values), then set kwargs["details"] = details and return
ErrorResponse(**kwargs).
"ERR_SSL_PROTOCOL_ERROR": [ | ||
"Check if the website has a valid SSL certificate", | ||
"Try using http:// instead of https:// if appropriate", | ||
"Contact the website administrator about certificate issues" | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid encouraging TLS downgrade.
Suggest fixing TLS/certificates rather than using plain HTTP.
Apply this diff:
- "ERR_SSL_PROTOCOL_ERROR": [
- "Check if the website has a valid SSL certificate",
- "Try using http:// instead of https:// if appropriate",
- "Contact the website administrator about certificate issues"
- ],
+ "ERR_SSL_PROTOCOL_ERROR": [
+ "Check that the site presents a valid TLS certificate (correct hostname, not expired, full chain)",
+ "Review server/client TLS settings (protocols/ciphers) and trust store",
+ "Contact the website administrator about certificate issues"
+ ],
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"ERR_SSL_PROTOCOL_ERROR": [ | |
"Check if the website has a valid SSL certificate", | |
"Try using http:// instead of https:// if appropriate", | |
"Contact the website administrator about certificate issues" | |
], | |
"ERR_SSL_PROTOCOL_ERROR": [ | |
"Check that the site presents a valid TLS certificate (correct hostname, not expired, full chain)", | |
"Review server/client TLS settings (protocols/ciphers) and trust store", | |
"Contact the website administrator about certificate issues" | |
], |
🤖 Prompt for AI Agents
In deploy/docker/core/error_context.py around lines 413 to 417, the
ERR_SSL_PROTOCOL_ERROR entry currently suggests using http:// which encourages
TLS downgrade; remove the plain-HTTP suggestion and replace it with remediation
steps focused on fixing TLS/certificate issues (e.g., verify certificate
validity and chain, ensure server supports up-to-date TLS versions and ciphers,
check for misconfigured intermediates/HSTS, and contact the site administrator
or host to renew/reinstall certificates); keep the other user-facing guidance
concise and security-first.
tool_name = getattr(route.endpoint, "__mcp_name__", None) or operation.get("summary") or route.name | ||
|
||
input_schema = {"type": "object", "properties": {}} | ||
|
||
if "requestBody" in operation: | ||
request_body = operation.get("requestBody", {}) | ||
if "content" in request_body and "$ref" in request_body.get("content", {}).get("application/json", {}).get("schema", {}): | ||
schema_ref = request_body["content"]["application/json"]["schema"]["$ref"] | ||
input_schema = resolve_ref(schema_ref) | ||
elif "content" in request_body: | ||
input_schema = request_body["content"].get("application/json", {}).get("schema", {}) | ||
|
||
if "parameters" in operation: | ||
if "properties" not in input_schema: | ||
input_schema["properties"] = {} | ||
|
||
properties = input_schema["properties"] | ||
for param in operation["parameters"]: | ||
if "$ref" in param: | ||
param = resolve_ref(param["$ref"]) | ||
|
||
if param.get("in") not in ["query", "path"]: | ||
continue | ||
|
||
param_name = param["name"] | ||
param_schema = param.get("schema", {}) | ||
|
||
if "description" in param: | ||
param_schema["description"] = param["description"] | ||
|
||
properties[param_name] = param_schema | ||
|
||
tool_schema = { | ||
"name": tool_name, | ||
"description": operation.get("description", "") or inspect.getdoc(route.endpoint) or "", | ||
"inputSchema": input_schema | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanitize tool names in the generated MCP schema.
get_mcp_schema
currently emits raw names; use the same sanitizer as registration to avoid invalid identifiers.
Apply this diff:
- tool_name = getattr(route.endpoint, "__mcp_name__", None) or operation.get("summary") or route.name
+ raw_name = getattr(route.endpoint, "__mcp_name__", None) or operation.get("summary") or route.name
+ tool_name = self._sanitize_tool_name(raw_name)
...
- tool_schema = {
- "name": tool_name,
+ tool_schema = {
+ "name": tool_name,
"description": operation.get("description", "") or inspect.getdoc(route.endpoint) or "",
"inputSchema": input_schema
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
tool_name = getattr(route.endpoint, "__mcp_name__", None) or operation.get("summary") or route.name | |
input_schema = {"type": "object", "properties": {}} | |
if "requestBody" in operation: | |
request_body = operation.get("requestBody", {}) | |
if "content" in request_body and "$ref" in request_body.get("content", {}).get("application/json", {}).get("schema", {}): | |
schema_ref = request_body["content"]["application/json"]["schema"]["$ref"] | |
input_schema = resolve_ref(schema_ref) | |
elif "content" in request_body: | |
input_schema = request_body["content"].get("application/json", {}).get("schema", {}) | |
if "parameters" in operation: | |
if "properties" not in input_schema: | |
input_schema["properties"] = {} | |
properties = input_schema["properties"] | |
for param in operation["parameters"]: | |
if "$ref" in param: | |
param = resolve_ref(param["$ref"]) | |
if param.get("in") not in ["query", "path"]: | |
continue | |
param_name = param["name"] | |
param_schema = param.get("schema", {}) | |
if "description" in param: | |
param_schema["description"] = param["description"] | |
properties[param_name] = param_schema | |
tool_schema = { | |
"name": tool_name, | |
"description": operation.get("description", "") or inspect.getdoc(route.endpoint) or "", | |
"inputSchema": input_schema | |
} | |
raw_name = getattr(route.endpoint, "__mcp_name__", None) or operation.get("summary") or route.name | |
tool_name = self._sanitize_tool_name(raw_name) | |
input_schema = {"type": "object", "properties": {}} | |
if "requestBody" in operation: | |
request_body = operation.get("requestBody", {}) | |
if "content" in request_body and "$ref" in request_body.get("content", {}).get("application/json", {}).get("schema", {}): | |
schema_ref = request_body["content"]["application/json/schema"]["$ref"] | |
input_schema = resolve_ref(schema_ref) | |
elif "content" in request_body: | |
input_schema = request_body["content"].get("application/json", {}).get("schema", {}) | |
if "parameters" in operation: | |
if "properties" not in input_schema: | |
input_schema["properties"] = {} | |
properties = input_schema["properties"] | |
for param in operation["parameters"]: | |
if "$ref" in param: | |
param = resolve_ref(param["$ref"]) | |
if param.get("in") not in ["query", "path"]: | |
continue | |
param_name = param["name"] | |
param_schema = param.get("schema", {}) | |
if "description" in param: | |
param_schema["description"] = param["description"] | |
properties[param_name] = param_schema | |
tool_schema = { | |
"name": tool_name, | |
"description": operation.get("description", "") or inspect.getdoc(route.endpoint) or "", | |
"inputSchema": input_schema | |
} |
🤖 Prompt for AI Agents
In deploy/docker/core/mcp_service.py around lines 93 to 129, the generated
tool_name for the MCP schema uses raw route/operation names which can produce
invalid identifiers; update this block to run the resolved tool_name through the
same sanitizer used during tool registration (the registration
name-normalization function) before assigning to tool_schema["name"]. Ensure you
import or reference the sanitizer function used elsewhere, call it on the
computed tool_name (after falling back to __mcp_name__, summary, or route.name),
and preserve existing behavior for description and inputSchema.
result = await proxy_fn(**prepared) | ||
return json.dumps(result, default=str) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid double‑encoding strings in non‑model wrapper.
Return strings as‑is; JSON‑encode only non‑string values.
Apply this diff:
- result = await proxy_fn(**prepared)
- return json.dumps(result, default=str)
+ result = await proxy_fn(**prepared)
+ return result if isinstance(result, str) else json.dumps(result, default=str)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
result = await proxy_fn(**prepared) | |
return json.dumps(result, default=str) | |
result = await proxy_fn(**prepared) | |
return result if isinstance(result, str) else json.dumps(result, default=str) |
🤖 Prompt for AI Agents
In deploy/docker/core/mcp_service.py around lines 444 to 446, the current code
JSON-encodes every result which causes already-string responses to be
double-encoded; change the return logic to detect string results and return them
directly (and optionally decode bytes to str) and only call json.dumps(result,
default=str) for non-string values so strings are not wrapped in additional
quotes.
# Use query params as fallback for default values | ||
filter_param = body.filter | ||
if body.filter == FilterType.FIT and f is not None: | ||
try: | ||
filter_param = FilterType(f) | ||
except ValueError: | ||
raise HTTPException(400, f"Invalid filter value: {f}. Must be one of: raw, fit, bm25, llm") | ||
|
||
query_param = q if body.query is None and q is not None else body.query | ||
cache_param = c if body.cache == "0" and c is not None else body.cache | ||
provider_param = body.provider | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialize enum value in JSON response
filter_param
is an Enum. JSONResponse may not serialize it as you expect. Return the underlying string.
return JSONResponse({
"url": body.url,
- "filter": filter_param,
+ "filter": filter_param.value,
"query": query_param,
"cache": cache_param,
"markdown": markdown,
"success": True
})
Also applies to: 334-343
🧰 Tools
🪛 Ruff (0.13.1)
324-324: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 318-329 (and likewise adjust lines
~334-343), filter_param is an Enum and may not JSON-serialize to its string
value; convert Enum members to their underlying string before returning or
including in JSON responses. Update the code to map filter_param =
filter_param.value (or str(filter_param) if using string values) after
determining it, and do the same for any other Enum fields returned in these
blocks so the response contains plain strings instead of Enum objects.
Capture a full-page PNG screenshot of the specified URL, waiting an optional delay before capture, returning base64 data if output_path is omitted. | ||
Use when you need an image snapshot of the rendered page. It is strongly recommended to provide `output_path` to save the screenshot to disk. | ||
This avoids returning large base64 payloads that can exceed token/context limits. When `output_path` is provided, the response returns the saved file path instead of inline data. | ||
Note on screenshot_wait_for: | ||
- Positive values: Wait the specified number of seconds before capturing | ||
- Zero or negative values: Accepted but treated as no wait (immediate capture) | ||
- Default: 0.0 seconds (immediate capture after page load) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc default mismatch for screenshot_wait_for
Schema default is 2.0s, docstring says 0.0. Align the text to avoid confusion.
- - Default: 0.0 seconds (immediate capture after page load)
+ - Default: 2.0 seconds (see ScreenshotRequest.screenshot_wait_for)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Capture a full-page PNG screenshot of the specified URL, waiting an optional delay before capture, returning base64 data if output_path is omitted. | |
Use when you need an image snapshot of the rendered page. It is strongly recommended to provide `output_path` to save the screenshot to disk. | |
This avoids returning large base64 payloads that can exceed token/context limits. When `output_path` is provided, the response returns the saved file path instead of inline data. | |
Note on screenshot_wait_for: | |
- Positive values: Wait the specified number of seconds before capturing | |
- Zero or negative values: Accepted but treated as no wait (immediate capture) | |
- Default: 0.0 seconds (immediate capture after page load) | |
""" | |
Capture a full-page PNG screenshot of the specified URL, waiting an optional delay before capture, returning base64 data if output_path is omitted. | |
Use when you need an image snapshot of the rendered page. It is strongly recommended to provide `output_path` to save the screenshot to disk. | |
This avoids returning large base64 payloads that can exceed token/context limits. When `output_path` is provided, the response returns the saved file path instead of inline data. | |
Note on screenshot_wait_for: | |
- Positive values: Wait the specified number of seconds before capturing | |
- Zero or negative values: Accepted but treated as no wait (immediate capture) | |
- Default: 2.0 seconds (see ScreenshotRequest.screenshot_wait_for) | |
""" |
🤖 Prompt for AI Agents
In deploy/docker/server.py around lines 392 to 400, the docstring for
screenshot_wait_for states the default is 0.0s while the schema default is 2.0s;
update the docstring to match the schema by changing the Default line to "2.0
seconds (wait before capture)" and keep the notes that positive values wait the
specified seconds and zero/negative are treated as no wait, so the documentation
and code defaults are aligned.
def test_js_execution_returns_values(): | ||
"""Test that JavaScript execution returns actual values""" | ||
|
||
# Test with a simple HTML page (using httpbin which returns HTML) | ||
url = "https://httpbin.org/html" | ||
|
||
# Scripts that should return actual values | ||
scripts = [ | ||
"return document.title || 'No title found'", # Handle empty title case | ||
"return document.querySelectorAll('h1').length", | ||
"return Array.from(document.querySelectorAll('p')).map(p => p.textContent.substring(0, 50))", | ||
"return 'test string'", # Simple string test | ||
"return 42" # Simple number test | ||
] | ||
|
||
response = requests.post("http://localhost:11235/execute_js", json={ | ||
"url": url, | ||
"scripts": scripts | ||
}) | ||
|
||
if response.status_code != 200: | ||
print(f"❌ Request failed with status {response.status_code}") | ||
print(f"Response: {response.text}") | ||
return False | ||
|
||
result = response.json() | ||
|
||
# Check if js_execution_result exists | ||
js_result = result.get('js_execution_result') | ||
if not js_result: | ||
print("❌ No js_execution_result in response") | ||
return False | ||
|
||
# Check if return_values exists (our fix) | ||
return_values = js_result.get('return_values') | ||
if not return_values: | ||
print("❌ No return_values in js_execution_result") | ||
print(f"js_execution_result keys: {list(js_result.keys())}") | ||
return False | ||
|
||
print("✅ return_values found in response!") | ||
print(f"Number of return values: {len(return_values)}") | ||
|
||
# Verify we have the expected number of return values | ||
if len(return_values) != len(scripts): | ||
print(f"❌ Expected {len(scripts)} return values, got {len(return_values)}") | ||
return False | ||
|
||
# Print the actual return values | ||
for i, value in enumerate(return_values): | ||
print(f"Script {i+1} returned: {repr(value)}") | ||
|
||
# Verify the first script (document.title) returns a string | ||
title = return_values[0] | ||
if title is None: | ||
print("⚠️ Title is None (empty title on this page)") | ||
elif not isinstance(title, str): | ||
print(f"❌ Expected title to be string, got {type(title)}") | ||
return False | ||
else: | ||
print(f"✅ Page title: '{title}'") | ||
|
||
# Verify the second script (h1 count) returns a number | ||
h1_count = return_values[1] | ||
if not isinstance(h1_count, (int, float)): | ||
print(f"❌ Expected h1_count to be number, got {type(h1_count)}") | ||
return False | ||
|
||
print(f"✅ H1 count: {h1_count}") | ||
|
||
# Verify the third script returns an array | ||
paragraphs = return_values[2] | ||
if not isinstance(paragraphs, list): | ||
print(f"❌ Expected paragraphs to be list, got {type(paragraphs)}") | ||
return False | ||
|
||
print(f"✅ Found {len(paragraphs)} paragraph excerpts") | ||
|
||
# Verify the fourth script (simple string) returns a string | ||
test_str = return_values[3] | ||
if test_str != 'test string': | ||
print(f"❌ Expected 'test string', got {repr(test_str)}") | ||
return False | ||
|
||
print(f"✅ Simple string test: '{test_str}'") | ||
|
||
# Verify the fifth script (simple number) returns a number | ||
test_num = return_values[4] | ||
if test_num != 42: | ||
print(f"❌ Expected 42, got {repr(test_num)}") | ||
return False | ||
|
||
print(f"✅ Simple number test: {test_num}") | ||
|
||
print("🎉 All tests passed! JavaScript execution now returns actual values.") | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert print/return style to real pytest assertions (tests won’t fail as written).
The functions return True/False and rely on prints. Pytest ignores return values; failures won’t be detected. Replace with asserts and use response.raise_for_status().
Apply this diff:
@@
-import requests
-import json
-import time
+import requests
+import json
+import pytest
@@
-def test_js_execution_returns_values():
+@pytest.mark.integration
+def test_js_execution_returns_values():
@@
- response = requests.post("http://localhost:11235/execute_js", json={
- "url": url,
- "scripts": scripts
- })
-
- if response.status_code != 200:
- print(f"❌ Request failed with status {response.status_code}")
- print(f"Response: {response.text}")
- return False
+ response = requests.post(
+ "http://localhost:11235/execute_js",
+ json={"url": url, "scripts": scripts},
+ timeout=30,
+ )
+ response.raise_for_status()
@@
- js_result = result.get('js_execution_result')
- if not js_result:
- print("❌ No js_execution_result in response")
- return False
+ js_result = result.get('js_execution_result')
+ assert js_result is not None, "js_execution_result missing"
@@
- return_values = js_result.get('return_values')
- if not return_values:
- print("❌ No return_values in js_execution_result")
- print(f"js_execution_result keys: {list(js_result.keys())}")
- return False
+ return_values = js_result.get('return_values')
+ assert return_values is not None, f"return_values missing; keys={list(js_result.keys())}"
@@
- if len(return_values) != len(scripts):
- print(f"❌ Expected {len(scripts)} return values, got {len(return_values)}")
- return False
+ assert len(return_values) == len(scripts), f"Expected {len(scripts)} return values, got {len(return_values)}"
@@
- for i, value in enumerate(return_values):
- print(f"Script {i+1} returned: {repr(value)}")
+ for i, value in enumerate(return_values):
+ print(f"Script {i+1} returned: {value!r}")
@@
- if title is None:
- print("⚠️ Title is None (empty title on this page)")
- elif not isinstance(title, str):
- print(f"❌ Expected title to be string, got {type(title)}")
- return False
- else:
- print(f"✅ Page title: '{title}'")
+ if title is None:
+ pytest.skip("Title is None on this page (acceptable)")
+ assert isinstance(title, str), f"Expected title to be string, got {type(title)}"
@@
- if not isinstance(h1_count, (int, float)):
- print(f"❌ Expected h1_count to be number, got {type(h1_count)}")
- return False
+ assert isinstance(h1_count, (int, float)), f"h1_count type: {type(h1_count)}"
@@
- if not isinstance(paragraphs, list):
- print(f"❌ Expected paragraphs to be list, got {type(paragraphs)}")
- return False
+ assert isinstance(paragraphs, list), f"paragraphs type: {type(paragraphs)}"
@@
- if test_str != 'test string':
- print(f"❌ Expected 'test string', got {repr(test_str)}")
- return False
+ assert test_str == 'test string', f"Expected 'test string', got {test_str!r}"
@@
- if test_num != 42:
- print(f"❌ Expected 42, got {repr(test_num)}")
- return False
+ assert test_num == 42, f"Expected 42, got {test_num!r}"
@@
- print("🎉 All tests passed! JavaScript execution now returns actual values.")
- return True
+ print("🎉 JavaScript execution returns expected values.")
@@
-def test_error_handling():
+@pytest.mark.integration
+def test_error_handling():
@@
- response = requests.post("http://localhost:11235/execute_js", json={
- "url": "https://httpbin.org/html",
- "scripts": scripts
- })
-
- if response.status_code != 200:
- print(f"❌ Error test request failed with status {response.status_code}")
- return False
+ response = requests.post(
+ "http://localhost:11235/execute_js",
+ json={"url": "https://httpbin.org/html", "scripts": scripts},
+ timeout=30,
+ )
+ response.raise_for_status()
@@
- if len(return_values) != 3:
- print(f"❌ Expected 3 return values in error test, got {len(return_values)}")
- return False
+ assert len(return_values) == 3, f"Expected 3 return values, got {len(return_values)}"
@@
- if not isinstance(return_values[0], str):
- print(f"❌ First value should be title string, got {type(return_values[0])}")
- return False
+ assert isinstance(return_values[0], str), f"First value type: {type(return_values[0])}"
@@
- if not isinstance(return_values[1], dict) or 'error' not in str(return_values[1]).lower():
- print(f"❌ Second value should be error object, got {return_values[1]}")
- return False
+ assert isinstance(return_values[1], dict) and 'error' in str(return_values[1]).lower(), f"Unexpected error object: {return_values[1]}"
@@
- if return_values[2] != 'success after error':
- print(f"❌ Third value should be 'success after error', got {return_values[2]}")
- return False
+ assert return_values[2] == 'success after error', f"Got {return_values[2]}"
@@
- print("✅ Error handling test passed!")
- return True
+ print("✅ Error handling path returns mixed success/error as expected.")
@@
-if __name__ == "__main__":
- print("Testing JavaScript execution fix...")
- print("Make sure the Docker server is running on localhost:11235")
- print()
-
- # Test basic functionality
- success1 = test_js_execution_returns_values()
- print()
-
- # Test error handling
- success2 = test_error_handling()
- print()
-
- if success1 and success2:
- print("🎉 All tests passed! The fix is working correctly.")
- else:
- print("❌ Some tests failed. Check the output above.")
+if __name__ == "__main__":
+ pytest.main([__file__, "-q"])
Also applies to: 109-155
🧰 Tools
🪛 Ruff (0.13.1)
27-27: Probable use of requests
call without timeout
(S113)
62-62: Use explicit conversion flag
Replace with conversion flag
(RUF010)
93-93: Use explicit conversion flag
Replace with conversion flag
(RUF010)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In tests/mcp/test_mcp_js_exec.py around lines 12 to 107 (and similarly lines 109
to 155), the test uses prints and True/False returns so pytest won’t fail on
errors; replace the procedural checks with pytest assertions and call
response.raise_for_status() after the POST to surface HTTP errors. Specifically:
after the requests.post call call response.raise_for_status(); replace checks
like "if response.status_code != 200: print(...); return False" with assert
response.status_code == 200 (or simply response.raise_for_status()); replace
existence checks and type checks (js_result, return_values, lengths, types, and
expected values) with assert statements including helpful messages (e.g. assert
'js_execution_result' in result, assert isinstance(title, str), assert
len(return_values) == len(scripts), assert return_values[3] == 'test string',
etc.), and remove return True/False and prints that indicate pass/fail so pytest
will correctly mark failures.
response = requests.post("http://localhost:11235/execute_js", json={ | ||
"url": url, | ||
"scripts": scripts | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add request timeouts to avoid hanging tests.
Both HTTP calls lack timeouts; add a reasonable timeout.
-response = requests.post("http://localhost:11235/execute_js", json={
+response = requests.post("http://localhost:11235/execute_js", json={
"url": url,
"scripts": scripts
-})
+}, timeout=30)
@@
-response = requests.post("http://localhost:11235/execute_js", json={
+response = requests.post("http://localhost:11235/execute_js", json={
"url": "https://httpbin.org/html",
"scripts": scripts
-})
+}, timeout=30)
Also applies to: 118-121
🧰 Tools
🪛 Ruff (0.13.1)
27-27: Probable use of requests
call without timeout
(S113)
🤖 Prompt for AI Agents
In tests/mcp/test_mcp_js_exec.py around lines 27-31 (and likewise at lines
118-121), the requests.post calls lack timeouts which can cause tests to hang;
update these calls to include a reasonable timeout parameter (e.g., timeout=5 or
another appropriate value) passed to requests.post so the test fails fast on
network issues; ensure both HTTP calls are modified consistently and choose a
timeout that balances CI network variability and responsiveness.
🚀 MCP Transport Modernization
Migrates from deprecated SSE to standard HTTP transport with production-ready architecture
🏗️ Architectural Improvements
Service-Oriented Structure
Key Technical Wins
🧪 New Testing Infrastructure
md
,html
,screenshot
,pdf
,execute_js
,crawl
,ask
🔒 Security Enhancements
Path Traversal Protection
os.path.realpath()
to prevent directory escape/tmp/crawl4ai-exports
(fromconfig.yml
)CRAWL4AI_EXPORT_DIR
environment variableSecurity Implementation
🔄 Migration Notes
f/q/c
) and modern (filter/query/cache
) formats supported🔄 Clean Implementation
This represents a complete architectural refactor with 6 focused commits vs the 22 iterative commits in PR #1519:
🚀 Impact
Ready for Review: Architecture and implementation assessment requested.