Release/v0.7.5 #1527

ntohidi · 2025-09-29T10:11:25Z

🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update

This PR introduces Crawl4AI v0.7.5 with major new features focused on extensibility and security.

🎯 What's New

🔧 Docker Hooks System

Complete pipeline customization with user-provided Python functions
hook points: on_browser_created, on_page_context_created, before_goto, after_goto, on_user_agent_updated, on_execution_started, before_retrieve_html, before_return_html
Safe execution with AST validation, timeout protection, and error isolation
Real working examples with authentication, performance optimization, and content processing

🤖 Enhanced LLM Integration

Custom provider support (OpenAI, Anthropic, Gemini, local models)
Temperature and base_url configuration for fine-tuned control
Multi-provider environment variable support
Docker API integration with enhanced LLM parameters

🔒 HTTPS Preservation

New preserve_https_for_internal_links=True flag
Maintains secure protocols throughout crawling
Supports modern web security requirements
Prevents authentication cookie loss and security warnings

🐍 Python 3.10+ Support

Dropped Python 3.9 support for modern features in the documentation

🛠️ Bug Fixes & Improvements

Major Fixes

URL Processing: Fixed '+' sign preservation in query parameters ([Bug]: Inconsistent handling of URL encoding rules: #1332)
JWT Authentication: Fixed Docker JWT validation issues ([Bug]: Docker - JWT authentication not enforced when enabled - requests without tokens are incorrectly allowed #1442)
Playwright Stealth: Fixed stealth features integration ([Bug]: ImportError when using enable_stealth=True #1481)
API Configuration: Fixed config handling ([Bug]: Docker server base_config overwrites user CrawlerRunConfig settings despite user explicitly setting values #1505)
Docker Errors: Enhanced error messaging and serialization ([Bug]: Docker server is not decoding or applying filter_chain #1419)
LLM Integration: Fixed custom provider support ([Bug]: Crawl4AI always defaults to OpenAI instead of using GROQ_API_KEY from .llm.env #1291)
Performance: Resolved backoff strategy issues ([Bug]: LLMExtractionStrategy ratelimit results in no attribute usage #989)

Community-Reported Issues

Multiple GitHub issues and Discord feedback addressed
Enhanced proxy configuration and error handling
Improved dependency management and compatibility

🔄 Breaking Changes

Python 3.10+ Required: Upgrade from Python 3.9
Proxy Parameter Deprecated: Use new proxy_config structure
New Dependency: Added cssselect for better CSS handling

📁 Files Changed

New Files

docs/releases_review/demo_v0.7.5.py - Working demo showcasing all new features
docs/blog/release-v0.7.5.md - Complete release notes

Updated Files

README.md - Added v0.7.5 highlights and updated version references
docs/md_v2/blog/index.md - Updated blog index with latest release
crawl4ai/__version__.py - Version bump to 0.7.5

🧪 Testing

✅ Working demo created with real examples
✅ All new features tested with live URLs (httpbin.org, quotes.toscrape.com)
✅ Docker hooks system validated with actual API calls
✅ HTTPS preservation tested with real sites
✅ LLM integration verified with multiple providers

📚 Documentation

Complete release notes with real examples
Working demo file that users can run
Updated README with version highlights
Blog index updated for visibility

Summary by CodeRabbit

New Features
- Docker Hooks System (8 hook points) and per-request hooks support; Marketplace and website-to-API example projects; enhanced LLM controls (per-request provider, temperature, base_url).
Improvements
- Option to preserve HTTPS for internal links; stealth browsing refinements; streaming/unified crawl flows; proxy_config-first handling; timezone-aware timestamps.
Bug Fixes
- URL normalization, serialization, and API error-handling fixes.
Documentation
- v0.7.5 release notes, tutorials, examples, demos, and playground updates.
Deprecations / Breaking Changes
- Python >=3.10 required; proxy string deprecated (use proxy_config); added cssselect dependency.

Signed-off-by: Emmanuel Ferdman <[email protected]>

Implements comprehensive hooks functionality allowing users to provide custom Python functions as strings that execute at specific points in the crawling pipeline. Key Features: - Support for all 8 crawl4ai hook points: • on_browser_created: Initialize browser settings • on_page_context_created: Configure page context • before_goto: Pre-navigation setup • after_goto: Post-navigation processing • on_user_agent_updated: User agent modification handling • on_execution_started: Crawl execution initialization • before_retrieve_html: Pre-extraction processing • before_return_html: Final HTML processing Implementation Details: - Created UserHookManager for validation, compilation, and safe execution - Added IsolatedHookWrapper for error isolation and timeout protection - AST-based validation ensures code structure correctness - Sandboxed execution with restricted builtins for security - Configurable timeout (1-120 seconds) prevents infinite loops - Comprehensive error handling ensures hooks don't crash main process - Execution tracking with detailed statistics and logging API Changes: - Added HookConfig schema with code and timeout fields - Extended CrawlRequest with optional hooks parameter - Added /hooks/info endpoint for hook discovery - Updated /crawl and /crawl/stream endpoints to support hooks Safety Features: - Malformed hooks return clear validation errors - Hook errors are isolated and reported without stopping crawl - Execution statistics track success/failure/timeout rates - All hook results are JSON-serializable Testing: - Comprehensive test suite covering all 8 hooks - Error handling and timeout scenarios validated - Authentication, performance, and content extraction examples - 100% success rate in production testing Documentation: - Added extensive hooks section to docker-deployment.md - Security warnings about user-provided code risks - Real-world examples using httpbin.org, GitHub, BBC - Best practices and troubleshooting guide ref #1377

…ncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

- Wrap all AsyncUrlSeeder usage with async context managers - Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

…ble #1310

Update URL seeding examples to use proper async context managers

Fix examples in README.md

fix(docker-api): migrate to modern datetime library API

…develop

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers due to a hardcoded api_key_env fallback in config.yml. This caused authentication errors when using non-OpenAI providers like Gemini. Changes: - Remove api_key_env from config.yml to let litellm handle provider-specific env vars - Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys - Update validate_llm_provider() to trust litellm's built-in key detection - Update documentation to reflect the new automatic key handling The fix leverages litellm's existing capability to automatically find the correct environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.) without manual configuration. ref #1291

fix(docker): Fix LLM API key handling for multi-provider support

…develop

…ted examples (#1330) - Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy) - Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library - Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs - Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

…eserve '+' signs. ref #1332

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode

Fixes bug reported in issue #1405 [Bug]: Excluded selector (excluded_selector) doesn't work This commit reintroduces the cssselect library which was removed by PR (#1368) and merged via (437395e). Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored. Refs: #1405

… deep crawl strategy (ref #1419) - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors - Ensure filter chains work correctly with Docker client and REST API The issue occurred because: 1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization 2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors Changes: - async_configs.py: Comment out __slots__ serialization logic (lines 100-109) - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

…s. ref #1437

…ration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.

feat(docker): Add temperature and base_url parameters for LLM configuration

…ptive-strategies-docs Update Quickstart and Adaptive Strategies documentation

- Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.

fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy

…and enhance proxy string parsing - Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials. - Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility. - Added warnings for deprecated usage and clarified behavior when both parameters are provided. - Updated documentation and tests to reflect changes in proxy configuration handling.

…ate .gitignore to include test_scripts directory.

…ring crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

…1410

fix(deep-crawl): BestFirst priority inversion

Fix/proxy deprecation

…. ref #1481

feat(StealthAdapter): fix stealth features for Playwright integration

… provided by user

#1505 fix(api): update config handling to only set base config if not provided by user

…xtraction

- Updated version to 0.7.5 - Added comprehensive demo and release notes - Updated documentation

coderabbitai · 2025-09-29T10:11:35Z

Walkthrough

Bumps release to v0.7.5 and introduces: Docker user hooks system, per-request LLM temperature/base_url propagation, HTTPS-preservation for internal internal links, Playwright stealth adapter, proxy deprecation/autoconversion, serialization/logging fixes, many docs/examples/tests, and Python 3.10+ requirement.

Changes

Cohort / File(s)	Summary
Version & packaging `crawl4ai/__version__.py`, `pyproject.toml`, `setup.py`, `requirements.txt`, `README.md`, `.gitignore`	Bump version to 0.7.5; require Python >=3.10; add cssselect; update docs/README release headings; expand .gitignore (add `.db`, `.env`, `test_scripts/`, `scripts/`).
HTTPS preservation `crawl4ai/utils.py`, `crawl4ai/async_webcrawler.py`, `crawl4ai/content_scraping_strategy.py`, `crawl4ai/async_configs.py`, `CHANGELOG.md`, `docs/md_v2/core/deep-crawling.md`, `tests/test_preserve_https_for_internal_links.py`	Add `preserve_https_for_internal_links` flag across configs; propagate `original_scheme`; extend `normalize_url*` APIs to optionally preserve HTTPS for internal links; docs and tests added.
Playwright stealth `crawl4ai/browser_adapter.py`, `crawl4ai/browser_manager.py`	Add `StealthAdapter`, apply stealth per page, simplify startup/teardown, and deprecate legacy stealth context handling.
Proxy deprecation & parsing `crawl4ai/async_configs.py`, `crawl4ai/browser_manager.py`, `docs/md_v2/advanced/proxy-security.md`, `tests/proxy/test_proxy_deprecation.py`, `tests/async/test_0.4.2_browser_manager.py`, `tests/memory/test_docker_config_gen.py`	Deprecate string `proxy`, prefer `proxy_config`; add warnings and automatic conversion; enhance proxy parsing; update docs and tests.
Docker Hooks system `deploy/docker/hook_manager.py`, `deploy/docker/server.py`, `deploy/docker/api.py`, `deploy/docker/static/playground/index.html`, `docs/examples/docker_hooks_examples.py`, `docs/releases_review/demo_v0.7.5.py`, `tests/docker/test_hooks_client.py`, `tests/docker/test_hooks_comprehensive.py`	New hook manager module and integration: validate/compile/execute user hooks with timeouts/isolation; server and API accept `hooks_config`, return `hooks_info`; UI/examples/tests added.
LLM parameter propagation (Docker) `deploy/docker/api.py`, `deploy/docker/job.py`, `deploy/docker/schemas.py`, `deploy/docker/utils.py`, `deploy/docker/.llm.env.example`, `deploy/docker/config.yml`, `deploy/docker/README.md`, `docs/md_v2/core/docker-deployment.md`, `tests/docker/test_llm_params.py`	Add per-request/provider `temperature` and `base_url` overrides; propagate through API/job flows; update schema models, env examples, utils resolution and tests.
Adaptive / embedding config `crawl4ai/adaptive_crawler.py`, `crawl4ai/async_configs.py`, `docs/md_v2/core/adaptive-crawling.md`, `docs/examples/adaptive_crawling/llm_config_example.py`, `tests/adaptive/test_llm_embedding.py`	Accept `LLMConfig` instances or dicts for embedding config; add compatibility helpers; update examples and tests.
Deep-crawling scoring & logging `crawl4ai/deep_crawling/bff_strategy.py`, `crawl4ai/deep_crawling/bfs_strategy.py`, `tests/general/test_bff_scoring.py`	Ensure real `logging.Logger` instances, invert score sign for priority queue semantics, adjust metadata scoring and tests.
Filters & serialization `crawl4ai/deep_crawling/filters.py`, `tests/docker/test_filter_deep_crawl.py`	Expose constructor inputs (`patterns`, `use_glob`, `reverse`) on `URLPatternFilter` for serialization; add E2E tests.
Serialization & model output fixes `crawl4ai/async_configs.py`, `crawl4ai/models.py`, `crawl4ai/async_crawler_strategy.back.py`	Avoid serializing private slots, include markdown in `CrawlResult` serialization, coerce deprecated props, and fix verbose flag reference.
Docker server: timezones & streaming `deploy/docker/api.py`, `deploy/docker/server.py`, `deploy/docker/static/playground/index.html`, `tests/docker/test_server_requests.py`	Use timezone-aware timestamps, centralize streaming flow, include `hooks_info` in responses, adjust streaming headers and UI toggles, add tests.
New examples & website-to-api `docs/examples/website-to-api/`, `docs/examples/adaptive_crawling/`, `docs/releases_review/demo_v0.7.5.py`	Add full Web Scraper API example (FastAPI + UI + library + tests), adaptive LLM examples, and release demo scripts.
Docs, release notes & website `README.md`, `CHANGELOG.md`, `docs/**`, `mkdocs.yml`	Add v0.7.5 release notes and blog, document hooks, LLM config, HTTPS flag, proxy deprecation; add marketplace docs, assets, and nav updates.
Tests & CI coverage `tests/**`	Many new/updated tests covering hooks, LLM params, proxy deprecation, deep crawl scoring, serialization, streaming, adaptive embedding demos.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant AsyncWebCrawler
  participant ContentStrategy as ContentScrapingStrategy
  participant Utils as utils.normalize_url

  Client->>AsyncWebCrawler: arun(url, **kwargs)
  AsyncWebCrawler->>AsyncWebCrawler: original_scheme = urlparse(url).scheme
  AsyncWebCrawler->>ContentStrategy: aprocess_html(..., original_scheme, preserve_https)
  ContentStrategy->>Utils: normalize_url(href, base_url, preserve_https, original_scheme)
  Utils-->>ContentStrategy: normalized URL (HTTPS preserved for internal links)
  ContentStrategy-->>AsyncWebCrawler: processed content
  AsyncWebCrawler-->>Client: results

sequenceDiagram
  participant Client
  participant Server
  participant HookMgr as UserHookManager
  participant Crawler
  participant Strategy

  Client->>Server: POST /crawl {urls, hooks.code, ...}
  Server->>HookMgr: validate + compile hooks
  HookMgr-->>Server: compiled hooks, errors
  Server->>Crawler: create + attach hooks (wrapped)
  Crawler->>Strategy: run crawl (hooks invoked around lifecycle)
  Strategy-->>Crawler: results + hook logs
  Server-->>Client: response {results, hooks_info, status}

sequenceDiagram
  participant Manager as BrowserManager
  participant Playwright
  participant Page
  participant Stealth as StealthAdapter

  Manager->>Playwright: start()
  Manager->>Playwright: browser.new_page()
  Playwright-->>Manager: Page
  Manager->>Stealth: apply_stealth(Page)
  Stealth-->>Manager: ok / warn
  Manager-->>Caller: Page ready (stealth if enabled)

sequenceDiagram
  participant Client
  participant Server
  participant API as deploy/docker/api.py
  participant Utils as deploy/docker/utils.py
  participant LLM

  Client->>Server: POST /md {provider, temperature, base_url}
  Server->>API: handle_markdown_request(...)
  API->>Utils: get_llm_api_key/provider/temp/base_url
  Utils-->>API: resolved params (env or payload)
  API->>LLM: request with provider + overrides
  LLM-->>API: content
  API-->>Client: Markdown JSON

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

2025 MAY Bug Fixes #1175 — Modifies normalize_url in crawl4ai/utils.py; likely adjacent to the HTTPS-preservation changes here.

Suggested reviewers

unclecode

Poem

A rabbit taps its keys with cheer,
New hooks hop in, the signals clear.
HTTPS keeps its hat of blue,
Stealthy pages slip right through.
LLMs warm their tea—v0.7.5, hooray! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description provides a thorough account of new features, bug fixes, breaking changes, files changed, testing, and documentation updates, but it does not follow the repository’s required template headings and omits the mandatory checklist section.	Please update the description to include the exact template headings—“## Summary”, “## List of files changed and why”, “## How Has This Been Tested?”, and “## Checklist:”—and complete the checklist to confirm adherence to style guidelines, documentation updates, and testing requirements.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title “Release/v0.7.5” directly indicates the primary intent of the changeset, which is bumping the project to version 0.7.5. It succinctly conveys the main update without extraneous details.
Docstring Coverage	✅ Passed	Docstring coverage is 83.72% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch release/v0.7.5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 34

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

deploy/docker/c4ai-doc-context.md (1)

8723-8735: Keep the example aligned with the current arun API

This snippet passes extraction_strategy=strategy directly to crawler.arun() while also creating a CrawlerRunConfig() that doesn’t carry the strategy. In v0.7.x the recommended (and documented earlier in this same file) pattern is to put the strategy inside the run config; the top-level arun() no longer accepts extraction_strategy and will raise a TypeError. Please move the strategy into CrawlerRunConfig to keep the example executable and consistent with the guidance above. Suggested fix:
-    config = CrawlerRunConfig()
+    config = CrawlerRunConfig(extraction_strategy=strategy)
...
-        result = await crawler.arun(
-            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
-            extraction_strategy=strategy,
-            config=config
-        )
+        result = await crawler.arun(
+            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
+            config=config
+        )

deploy/docker/utils.py (1)

22-41: Harden load_config() against missing llm key.

Avoid KeyError when config.yml lacks an llm section.

-    with open(config_path, "r") as config_file:
-        config = yaml.safe_load(config_file)
+    with open(config_path, "r") as config_file:
+        config = yaml.safe_load(config_file) or {}
+    config.setdefault("llm", {})

crawl4ai/async_configs.py (1)

550-586: Round‑trip gaps in BrowserConfig.from_kwargs.

viewport and sleep_on_close are emitted in to_dict() but not accepted in from_kwargs(), breaking round‑trip via dump/load/clone.

         return BrowserConfig(
@@
-            viewport_width=kwargs.get("viewport_width", 1080),
-            viewport_height=kwargs.get("viewport_height", 600),
+            viewport_width=kwargs.get("viewport_width", 1080),
+            viewport_height=kwargs.get("viewport_height", 600),
+            viewport=kwargs.get("viewport"),
@@
-            extra_args=kwargs.get("extra_args", []),
+            extra_args=kwargs.get("extra_args", []),
+            sleep_on_close=kwargs.get("sleep_on_close", False),

Also applies to: 588-624

deploy/docker/api.py (1)

456-466: Force non‑stream mode for this endpoint.

If caller sets stream=True in crawler_config, arun_many will return an async generator and await partial_func() will fail. Explicitly disable streaming here.

         browser_config = BrowserConfig.load(browser_config)
-        crawler_config = CrawlerRunConfig.load(crawler_config)
+        crawler_config = CrawlerRunConfig.load(crawler_config)
+        # Non-streaming endpoint must return materialized results
+        if getattr(crawler_config, "stream", False):
+            crawler_config.stream = False

docs/md_v2/core/docker-deployment.md (1)

61-71: Fix release version references to 0.7.5.

This section still instructs users to pull the 0.7.3 image and states that latest points to 0.7.3. For the 0.7.5 release this is incorrect and will cause users to run the stale build. Please update the version numbers and accompanying note to 0.7.5 before publishing these docs.

crawl4ai/utils.py (2)

1791-1812: Last-attempt raise introduces behavior change; dead branch below.

In RateLimitError handling you now re-raise on the last attempt (Line 1793), but the later else branch (Lines 1804–1811) is unreachable. This both changes the prior contract (callers likely expect an error payload) and leaves dead code. Prefer one clear policy: either return a structured error or re-raise consistently.

Proposed fix: keep non-throwing contract and remove the dead branch.

@@
-        except RateLimitError as e:
-            print("Rate limit error:", str(e))
-            if attempt == max_attempts - 1:
-                # Last attempt failed, raise the error.
-                raise
-
-            # Check if we have exhausted our max attempts
-            if attempt < max_attempts - 1:
-                # Calculate the delay and wait
-                delay = base_delay * (2**attempt)  # Exponential backoff formula
-                print(f"Waiting for {delay} seconds before retrying...")
-                time.sleep(delay)
-            else:
-                # Return an error response after exhausting all retries
-                return [
-                    {
-                        "index": 0,
-                        "tags": ["error"],
-                        "content": ["Rate limit error. Please try again later."],
-                    }
-                ]
+        except RateLimitError as e:
+            print("Rate limit error:", str(e))
+            if attempt == max_attempts - 1:
+                # Exhausted: return structured error (preserve prior behavior)
+                return [{
+                    "index": 0,
+                    "tags": ["error"],
+                    "content": ["Rate limit error. Please try again later."]
+                }]
+            # Exponential backoff
+            delay = base_delay * (2**attempt)
+            print(f"Waiting for {delay} seconds before retrying...")
+            time.sleep(delay)

2126-2142: Duplicate normalize_url definition causes confusion.

There are two normalize_url functions; Python will keep the latter one (Lines 2146+). The earlier simple version (Lines 2126–2142) is dead code and misleading.

-def normalize_url(href, base_url):
-    """Normalize URLs to ensure consistent format"""
-    from urllib.parse import urljoin, urlparse
-
-    # Parse base URL to get components
-    parsed_base = urlparse(base_url)
-    if not parsed_base.scheme or not parsed_base.netloc:
-        raise ValueError(f"Invalid base URL format: {base_url}")
-    
-    if  parsed_base.scheme.lower() not in ["http", "https"]:
-        # Handle special protocols
-        raise ValueError(f"Invalid base URL format: {base_url}")
-    cleaned_href = href.strip()
-
-    # Use urljoin to handle all cases
-    return urljoin(base_url, cleaned_href)
+# (Removed duplicate normalize_url; single extended version lives below)

crawl4ai/adaptive_crawler.py (1)

1476-1480: Links mutation assumes dict; breaks when Links object is returned.

Here you subscript result.links as a dict unconditionally. Elsewhere you correctly branch by type. This will raise at runtime if links is a model object.

-            # Filter our all links do not have head_date
-            if hasattr(result, 'links') and result.links:
-                result.links['internal'] = [link for link in result.links['internal'] if link.get('head_data')]
-                # For now let's ignore external links without head_data
-                # result.links['external'] = [link for link in result.links['external'] if link.get('head_data')]
+            # Filter out links without head_data
+            if hasattr(result, 'links') and result.links:
+                if isinstance(result.links, dict):
+                    internal = [l for l in result.links.get('internal', []) if l.get('head_data')]
+                    result.links['internal'] = internal
+                    # Optionally filter external similarly
+                    # result.links['external'] = [l for l in result.links.get('external', []) if l.get('head_data')]
+                else:
+                    # Links object with .internal/.external lists of Link models
+                    result.links.internal = [l for l in result.links.internal if getattr(l, 'head_data', None)]
+                    # Optionally: result.links.external = [l for l in result.links.external if getattr(l, 'head_data', None)]

🧹 Nitpick comments (61)

docs/examples/website-to-api/requirements.txt (1)

1-5: Pin example dependencies to tested versions.

Leaving these requirements unconstrained makes the demo brittle—any future major release of FastAPI, Uvicorn, Pydantic, or LiteLLM (or even crawl4ai itself) can introduce breaking changes and silently break the example. Please lock each dependency to the exact versions you validated for v0.7.5 (or at least constrain to compatible ranges) so users can reproduce the documented behavior.

CHANGELOG.md (1)

8-17: Align changelog entry with the 0.7.5 release

This branch is publishing v0.7.5, but the new note lives under a fresh “Unreleased” section, while another “Unreleased” still exists further down. Please fold this into a ## [0.7.5] - 2025-09-29 section (or move the flag under the upcoming release heading) so the changelog isn’t left with duplicate Unreleased buckets.
tests/docker/test_filter_deep_crawl.py (5)
11-11: Parameterize BASE_URL (avoid hard-coded port).

Read from env with a sane default so CI/dev runs don't depend on port 11234 being used.
-BASE_URL = "http://localhost:11234/"  # Adjust port as needed
+import os
+BASE_URL = os.getenv("C4A_BASE_URL", "http://localhost:8000/")
21-74: Use try/except/else for clearer success path.

Move the success print/return into an else: block to satisfy TRY300 and clarify flow.
-    try:
+    try:
         async with Crawl4aiDockerClient(
             base_url=BASE_URL,
             verbose=True,
         ) as client:
             ...
-        print("\n✅ Docker client test completed successfully!")
-        return True
+    except httpx.HTTPError as e:
+        print(f"❌ Docker client test failed (HTTP): {e}")
+        import traceback; traceback.print_exc()
+        return False
+    except Exception as e:
+        print(f"❌ Docker client test failed: {e}")
+        import traceback; traceback.print_exc()
+        return False
+    else:
+        print("\n✅ Docker client test completed successfully!")
+        return True
75-79: Avoid blind except Exception; catch httpx.HTTPError first.

Narrowing exceptions improves debuggability and addresses BLE001.
-    except Exception as e:
+    except httpx.HTTPError as e:
+        print(f"❌ REST API test failed (HTTP): {e}")
+        import traceback; traceback.print_exc()
+        return False
+    except Exception as e:
         print(f"❌ REST API test failed: {e}")
         import traceback
         traceback.print_exc()
         return False
Also applies to: 151-155

45-51: Prefer assertions over prints if this is meant for CI.

If this file lives under tests/, replace prints with asserts (pytest + pytest-asyncio). If it is a demo, consider moving it to docs/ or scripts/ to avoid CI discovery confusion.

Would you like a pytest-asyncio version with proper assertions and markers?

Also applies to: 124-131

52-70: Clarify result typing (list vs object vs stream).

Instead of hasattr checks, branch on known types: list, AsyncGenerator, or CrawlResult. This avoids surprising paths.

Also applies to: 132-146
deploy/docker/utils.py (5)
74-90: Silence ARG001 and document intent.

provider is unused by design; rename to _provider and note that litellm resolves env vars.
-def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> Optional[str]:
+def get_llm_api_key(config: Dict, _provider: Optional[str] = None) -> Optional[str]:
     """Get the appropriate API key based on the LLM provider.
@@
-    # Return None - litellm will automatically find the right environment variable
+    # Return None - litellm will automatically find the right environment variable
     return None
92-109: Validation always returns True — consider a debug log.

Returning (True, "") defers validation to litellm; add a debug log when no direct key is present to aid ops.

57-62: Use typing.Any in datetime_handler signature.

Minor typing nit.
-from typing import Dict, Optional
+from typing import Any, Dict, Optional
@@
-def datetime_handler(obj: any) -> Optional[str]:
+def datetime_handler(obj: Any) -> Optional[str]:
111-146: Consider config-based fallback for temperature/base_url.

Optional: if env vars are absent, read config.get("llm", {}).get("temperature"/"base_url") before returning None.

Do you want me to wire this in and update deploy/docker/api.py callers accordingly?

Also applies to: 148-172

174-181: Catch specific DNS errors (avoid blind except).

Narrow exception scope and avoid unused variable.
-    except Exception as e:
-        return False
+    except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer, dns.resolver.NoNameservers, dns.exception.DNSException):
+        return False
crawl4ai/async_configs.py (5)
261-294: Proxy string parsing: add IPv6 safety and tests.

The split-on-":" logic will misparse IPv6 literals. Consider using urllib.parse for URL forms and a regex for colon-forms, or require scheme for IPv6. Add tests for:

http(s)://user:pass@host:port

socks5://host:port

ip:port and ip:port:user:pass

IPv6 [::1]:8080 with/without scheme

512-517: Typo: fa_user_agenr_generator → fa_user_agent_generator.

Minor readability fix.
-        fa_user_agenr_generator = ValidUAGenerator()
+        fa_user_agent_generator = ValidUAGenerator()
         if self.user_agent_mode == "random":
-            self.user_agent = fa_user_agenr_generator.generate(
+            self.user_agent = fa_user_agent_generator.generate(
                 **(self.user_agent_generator_config or {})
             )
1454-1462: Don’t raise on deprecated attrs; warn instead.

Raising in __setattr__ breaks backward compatibility whenever callers pass these kwargs. Emit a deprecation warning and keep setting (or translate to cache_mode) for one release.
-        if name in self._UNWANTED_PROPS and value is not all_params[name].default:
-            raise AttributeError(f"Setting '{name}' is deprecated. {self._UNWANTED_PROPS[name]}")
+        if name in self._UNWANTED_PROPS and value is not all_params[name].default:
+            warnings.warn(f"'{name}' is deprecated. {self._UNWANTED_PROPS[name]}", UserWarning, stacklevel=2)
Also applies to: 1069-1074

1502-1503: Use PAGE_TIMEOUT constant in from_kwargs.

Keep defaults consistent with __init__.
-            page_timeout=kwargs.get("page_timeout", 60000),
+            page_timeout=kwargs.get("page_timeout", PAGE_TIMEOUT),
223-347: Consolidate ProxyConfig definitions
Remove the duplicate ProxyConfig class in crawl4ai/proxy_strategy.py and have that module re-export ProxyConfig from crawl4ai/async_configs.py (e.g. via from .async_configs import ProxyConfig) with a deprecation warning stub, so there’s a single source-of-truth and no drift.
deploy/docker/hook_manager.py (6)
19-31: Annotate mutable class attributes as ClassVar to satisfy linters and intent.

Prevents them from being treated as instance attributes.
-from typing import Dict, Callable, Optional, Tuple, List, Any
+from typing import Dict, Callable, Optional, Tuple, List, Any, ClassVar
@@
-    HOOK_SIGNATURES = {
+    HOOK_SIGNATURES: ClassVar[Dict[str, List[str]]] = {
@@
-    DEFAULT_TIMEOUT = 30
+    DEFAULT_TIMEOUT: ClassVar[int] = 30
68-71: Nit: remove unnecessary f-string.

No placeholders present.
-                return False, f"Hook function must be async (use 'async def' instead of 'def')"
+                return False, "Hook function must be async (use 'async def' instead of 'def')"
168-176: Prefer logger.exception to capture stack traces for compilation failures.

Improves diagnostics.
-            logger.error(f"Hook compilation failed for {hook_point}: {str(e)}")
+            logger.exception("Hook compilation failed for %s: %s", hook_point, e)
322-327: Prefer logger.exception in unexpected wrapper errors.

Captures traceback for postmortem.
-                logger.error(f"Unexpected error in hook wrapper for {hook_point}: {e}")
+                logger.exception("Unexpected error in hook wrapper for %s: %s", hook_point, e)
494-499: Prefer logger.exception on attach failures; include exception context.
-        except Exception as e:
-            logger.error(f"Failed to attach hook to {hook_point}: {e}")
+        except Exception as e:
+            logger.exception("Failed to attach hook to %s: %s", hook_point, e)
             validation_errors.append({
                 'hook_point': hook_point,
                 'error': f'Failed to attach hook: {str(e)}'
             })
118-141: Security posture note: import allows escaping builtin restrictions.

Even with curated builtins, users can import builtins and regain open, etc. If this is intended (trusted hooks inside Docker), document clearly. If not, add an AST gate on Import/ImportFrom names (allowlist) or a custom import stub.

Would you like a follow-up patch to allowlist e.g. {"asyncio","json","re"} and block builtins, os, subprocess at AST-validation time?

Also applies to: 150-166
deploy/docker/api.py (4)
640-649: Normalize URL schemes in streaming path too.

Parity with non‑streaming handler avoids accidental scheme-less inputs breaking streaming.
-        # Attach hooks if provided
+        # Normalize URLs (add https:// when missing)
+        urls = [('https://' + u) if not u.startswith(('http://','https://')) and not u.startswith(("raw:", "raw://")) else u for u in urls]
+
+        # Attach hooks if provided
545-546: Use logger.exception for serialization/hook data errors.

Captures stack traces; easier ops triage.
-                logger.error(f"Error processing result: {e}")
+                logger.exception("Error processing result: %s", e)
@@
-                    logger.error(f"Hook data not JSON serializable: {e}")
+                    logger.exception("Hook data not JSON serializable: %s", e)
@@
-                logger.error(f"Serialization error: {e}")
+                logger.exception("Serialization error: %s", e)
Also applies to: 577-578, 426-427

352-359: Standardize timestamps to UTC ISO‑8601 with timezone.

Two paths write different timestamp flavors (local naive vs UTC-naive). Prefer UTC with offset or 'Z' for consistency.
-    from datetime import datetime
-    task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
+    from uuid import uuid4
+    task_id = f"llm_{uuid4().hex[:8]}"
@@
-        "created_at": datetime.now().isoformat(),
+        "created_at": datetime.now(timezone.utc).isoformat().replace("+00:00","Z"),
@@
-        "created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
+        "created_at": datetime.now(timezone.utc).isoformat().replace("+00:00","Z"),
If external clients parse created_at, confirm they accept the trailing 'Z'. I can adjust to keep the offset form if preferred.

Also applies to: 693-695

476-483: Import path robustness for in‑package modules.

from hook_manager import ... relies on working CWD/sys.path. Consider relative import (from .hook_manager import ...) for package safety, mirroring how deploy/docker is laid out.

Do we install deploy/docker as a package (so relative imports are valid), or run it as a script with PYTHONPATH hacks? I can adjust to match your runtime.

Also applies to: 642-649
crawl4ai/async_webcrawler.py (1)
357-357: Optional: move import to module scope.

Minor micro‑perf/readability.
-                    from urllib.parse import urlparse
+from urllib.parse import urlparse
docs/md_v2/api/parameters.md (1)
8-15: Keep the example aligned with the proxy deprecation

The table now marks proxy as deprecated, but the code snippet immediately above still demonstrates proxy= usage. Please switch the example to proxy_config so readers don’t copy a deprecated pattern.
 browser_cfg = BrowserConfig(
     browser_type="chromium",
     headless=True,
     viewport_width=1280,
     viewport_height=720,
-    proxy="http://user:pass@proxy:8080",
+    proxy_config={
+        "server": "http://proxy:8080",
+        "username": "user",
+        "password": "pass",
+    },
     user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
 )
docs/examples/website-to-api/static/styles.css (4)
56-66: Add visible focus states for keyboard users.

Interactive elements lack explicit focus-visible styling, which is an accessibility blocker for keyboard navigation. Add clear focus indicators to links and buttons.
+.nav-link:focus-visible {
+  outline: 2px solid #09b5a5;
+  outline-offset: 2px;
+}
+
+.extract-btn:focus-visible,
+.copy-btn:focus-visible,
+.save-btn:focus-visible,
+.btn-danger:focus-visible {
+  outline: 2px solid #09b5a5;
+  outline-offset: 2px;
+}
Also applies to: 210-229, 338-355, 581-599, 649-666

259-264: Avoid removing outlines on focus; replace with accessible outline.

Using outline: none removes a critical focus cue. Provide a replacement outline for accessibility.
-select:focus,
-.input-group select:focus,
-.option-group select:focus {
-    outline: none !important;
-    border-color: #09b5a5 !important;
-}
+select:focus,
+.input-group select:focus,
+.option-group select:focus {
+    outline: 2px solid #09b5a5 !important;
+    outline-offset: 2px !important;
+    border-color: #09b5a5 !important;
+}
299-311: Improve long-text wrapping in code blocks to preserve readability.

word-break: break-all splits tokens mid-word; prefer overflow-wrap:anywhere and word-break:normal.
 .api-request-box pre,
 .json-response-box pre {
   font-family: 'Courier New', monospace;
   font-size: 0.85rem;
   line-height: 1.5;
   color: #FFFFFF;
   background: #1A1A1A;
   padding: 1rem;
   border-radius: 4px;
   overflow-x: auto;
-  white-space: pre-wrap;
-  word-break: break-all;
+  white-space: pre-wrap;
+  word-break: normal;
+  overflow-wrap: anywhere;
 }
@@
 .request-curl pre {
   color: #CCCCCC;
   font-size: 0.8rem;
   line-height: 1.4;
   overflow-x: auto;
-  white-space: pre-wrap;
-  word-break: break-all;
+  white-space: pre-wrap;
+  word-break: normal;
+  overflow-wrap: anywhere;
   background: #111111;
   padding: 0.75rem;
   border-radius: 4px;
   border: 1px solid #333;
 }
Also applies to: 536-547

412-425: Respect reduced-motion preferences.

Offer a non-animated spinner for users with motion sensitivity.
 @keyframes spin {
     0% { transform: rotate(0deg); }
     100% { transform: rotate(360deg); }
 }
+
+@media (prefers-reduced-motion: reduce) {
+  .spinner {
+    animation: none;
+    border-top-color: #09b5a5; /* static indicator */
+  }
+}
docs/examples/website-to-api/api_server.py (4)
165-171: Align response serialization with Pydantic v2.

Serialize ScrapeResponse using model_dump().
-            response=response_data.dict()
+            response=response_data.model_dump() if hasattr(response_data, "model_dump") else response_data.dict()
Also applies to: 223-229

140-174: Slim down blanket exception handling and preserve cause.

Catching Exception broadly is noisy; where you keep it, chain the cause and move success return to an else: block for clarity.
-    try:
+    try:
         # Save the API request
         headers = {"Content-Type": "application/json"}
         body = {
             "url": str(request.url),
             "query": request.query,
             "model_name": request.model_name
         }
         
         result = await scraper_agent.scrape_data(
             url=str(request.url),
             query=request.query,
             model_name=request.model_name
         )
         
         response_data = ScrapeResponse(
             success=True,
             url=result["url"],
             query=result["query"],
             extracted_data=result["extracted_data"],
             schema_used=result["schema_used"],
             timestamp=result["timestamp"]
         )
-        
-        # Save the request with response
-        save_api_request(
-            endpoint="/scrape",
-            method="POST",
-            headers=headers,
-            body=body,
-            response=response_data.dict()
-        )
-        
-        return response_data
-    
-    except Exception as e:
+    except Exception as e:
         # Save the failed request
         headers = {"Content-Type": "application/json"}
         body = {
             "url": str(request.url),
             "query": request.query,
             "model_name": request.model_name
         }
         
         save_api_request(
             endpoint="/scrape",
             method="POST",
             headers=headers,
             body=body,
             response={"error": str(e)}
         )
-        
-        raise HTTPException(status_code=500, detail=f"Scraping failed: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Scraping failed: {e}") from e
+    else:
+        save_api_request(
+            endpoint="/scrape",
+            method="POST",
+            headers=headers,
+            body=body,
+            response=response_data.model_dump() if hasattr(response_data, "model_dump") else response_data.dict()
+        )
+        return response_data
Also applies to: 175-193

92-109: Avoid blocking the event loop with file I/O in request paths.

get_saved_requests() does directory scans and file reads on the event loop. Consider offloading to a thread or using aiofiles to keep the API responsive under load.

13-26: Consider enabling CORS for the example app.

If the static UI is served from a different origin, add CORS to simplify local demos.
 from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
@@
 app = FastAPI(
@@
 )
+
+# CORS (demo-friendly; tighten in prod)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
docs/examples/adaptive_crawling/llm_config_example.py (1)
12-14: AsyncWebCrawler(verbose=...) is likely ignored.

Constructor typically expects config via BrowserConfig; the verbose kwarg may be dropped. Either pass a config or omit it.
-    async with AsyncWebCrawler(verbose=False) as crawler:
+    async with AsyncWebCrawler() as crawler:
@@
-    async with AsyncWebCrawler(verbose=True) as crawler:
+    async with AsyncWebCrawler() as crawler:
If you want explicit verbosity:
from crawl4ai import BrowserConfig
async with AsyncWebCrawler(config=BrowserConfig(verbose=False)) as crawler:
    ...
Also applies to: 96-99
tests/docker/test_hooks_comprehensive.py (6)
1-1: Remove shebang or make file executable.

Tests don’t need a shebang; drop it for cleanliness.
-#!/usr/bin/env python3
7-13: Add a global timeout for HTTP calls.

Avoid hanging tests; use a single TIMEOUT constant.
 import requests
 import json
 import time
+import os
 from typing import Dict, Any
 
-API_BASE_URL = "http://localhost:11234"
+API_BASE_URL = "http://localhost:11234"
+TIMEOUT = int(os.getenv("HOOK_TEST_TIMEOUT", "30"))
168-171: Pass an explicit timeout to requests.post.

Prevents indefinite hangs and aligns with best practices.
-response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
Also applies to: 281-283, 375-379, 465-469

189-195: Remove pointless f-strings.

These f-strings have no placeholders.
-                print(f"\n📈 Execution Statistics:")
+                print("\n📈 Execution Statistics:")
@@
-                print(f"\n📝 Execution Log:")
+                print("\n📝 Execution Log:")
@@
-            print(f"\n📄 Crawl Results:")
+            print("\n📄 Crawl Results:")
Also applies to: 197-205

213-219: Avoid bare except; catch JSON decode errors precisely.

Catching everything hides real issues.
-        except:
-            print(f"Error text: {response.text[:500]}")
+        except (ValueError, json.JSONDecodeError):
+            print(f"Error text: {response.text[:500]}")
501-505: Narrow the exception in the test runner.

Catching Exception is fine for a top-level test harness, but prefer logging the type and message.
-        except Exception as e:
-            print(f"❌ {name} failed: {e}")
+        except Exception as e:
+            print(f"❌ {name} failed: {type(e).__name__}: {e}")
             import traceback
             traceback.print_exc()
crawl4ai/utils.py (1)
2187-2196: HTTPS preservation: compare hostnames (ignore default ports) and deduplicate logic.

Current check uses parsed_full.netloc == parsed_base.netloc. This fails when one side includes default ports (e.g., example.com vs example.com:443) and repeats across three functions.

Compare hostname fields to ignore ports.

Factor the preserve-HTTPS snippet into a helper to avoid divergence.
-        if (parsed_full.scheme == 'http' and 
-            parsed_full.netloc == parsed_base.netloc and
+        if (parsed_full.scheme == 'http' and 
+            parsed_full.hostname == parsed_base.hostname and
             not href.strip().startswith('//')):
             full_url = full_url.replace('http://', 'https://', 1)
Apply the same change in normalize_url_for_deep_crawl and efficient_normalize_url_for_deep_crawl.

Optional: extract helper (outside these ranges):
def _preserve_https_if_internal(full_url, href, base_url, preserve_https, original_scheme):
    if preserve_https and original_scheme == 'https':
        pf, pb = urlparse(full_url), urlparse(base_url)
        if pf.scheme == 'http' and pf.hostname == pb.hostname and not href.strip().startswith('//'):
            return full_url.replace('http://', 'https://', 1)
    return full_url
Then call it in all three places.

Also applies to: 2260-2268, 2317-2325
crawl4ai/browser_adapter.py (2)
173-185: Use callable() and iscoroutinefunction; avoid blanket except.

Replace hasattr(self._stealth_function, '__call__') with callable(...).

Use inspect.iscoroutinefunction to decide await.

Avoid silent except Exception: pass to preserve diagnosability (log at least).
+import inspect
@@
-        if self._stealth_available and self._stealth_function:
-            try:
-                if hasattr(self._stealth_function, '__call__'):
-                    if 'async' in getattr(self._stealth_function, '__name__', ''):
-                        await self._stealth_function(page)
-                    else:
-                        self._stealth_function(page)
-            except Exception as e:
-                # Fail silently or log error depending on requirements
-                pass
+        if self._stealth_available and callable(self._stealth_function):
+            try:
+                if inspect.iscoroutinefunction(self._stealth_function):
+                    await self._stealth_function(page)
+                else:
+                    self._stealth_function(page)
+            except Exception as e:
+                # Log at debug level or collect in a diagnostics sink
+                # print(f"stealth apply failed: {e}")
+                return
261-264: Unused parameter naming to silence linters.

retrieve_console_messages(self, page) ignores page. Consider _page to reflect intentional unused parameter.
-    async def retrieve_console_messages(self, page: Page) -> List[Dict]:
+    async def retrieve_console_messages(self, _page: Page) -> List[Dict]:
         """Not needed for Playwright - messages are captured via events"""
         return []
docs/examples/docker_hooks_examples.py (3)
169-173: Add request timeouts to prevent hangs.

All requests.post(...) calls lack a timeout. Add a module-level TIMEOUT and pass timeout=TIMEOUT.
+TIMEOUT = 30
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
Also applies to: 281-283, 376-379, 466-470

190-206: Remove f-strings without placeholders.

These are plain strings; drop the f prefix.
-                print(f"\n📈 Execution Statistics:")
+                print("\n📈 Execution Statistics:")
-                print(f"\n📝 Execution Log:")
+                print("\n📝 Execution Log:")
-            print(f"\n📄 Crawl Results:")
+            print("\n📄 Crawl Results:")
214-220: Narrow exception type when parsing JSON error body.

Use ValueError (or requests.JSONDecodeError) instead of bare except.
-        except:
+        except ValueError:
             print(f"Error text: {response.text[:500]}")
crawl4ai/adaptive_crawler.py (2)
620-624: Type annotation should include None explicitly.

Defaulting llm_config=None with Union[...] triggers type-checker warnings. Use Optional[Union[LLMConfig, Dict]].
-    def __init__(self, embedding_model: str = None, llm_config: Union[LLMConfig, Dict] = None):
+    def __init__(self, embedding_model: str = None, llm_config: Optional[Union[LLMConfig, Dict]] = None):
988-993: Store learning_score metric for downstream consumers.

get_quality_confidence and stats expect learning_score, but calculate_confidence no longer sets it. Persist it in state.metrics.
         score = float((best >= tau).mean()) if tau is not None else float(best.mean())
 
         # Store quick metrics
         state.metrics['coverage_score'] = score
+        state.metrics['learning_score'] = score
         state.metrics['avg_best_similarity'] = float(best.mean())
         state.metrics['median_best_similarity'] = float(np.median(best))
deploy/docker/server.py (2)
575-579: All-results-failed branch can IndexError on empty list.

Guard against empty results before indexing.
-    if all(not result["success"] for result in results["results"]):
-        raise HTTPException(500, f"Crawl request failed: {results['results'][0]['error_message']}")
+    if not results["results"] or all(not r.get("success") for r in results["results"]):
+        first_err = next((r.get("error_message") for r in results.get("results", []) if r.get("error_message")), "Crawl failed")
+        raise HTTPException(500, f"Crawl request failed: {first_err}")
618-626: Header value should be plain string, not JSON-quoted.

hooks_info['status']['status'] is already a string; avoid json.dumps to prevent adding quotes in header value.
-        headers["X-Hooks-Status"] = json.dumps(hooks_info['status']['status'])
+        headers["X-Hooks-Status"] = str(hooks_info['status']['status'])
docs/releases_review/demo_v0.7.5.py (4)
63-93: Hook config: brace-glob won’t match; add marker hook and align summary

Playwright’s glob matcher doesn’t support the {png,jpg,...} brace pattern; routes won’t be registered as intended.

You check for “Crawl4AI v0.7.5 Docker Hook” in HTML and print summaries for hooks you didn’t configure.

Replace the image-blocking route and add a before_return_html hook to inject a marker. Align the summary with configured hooks.
         "on_page_context_created": """
 async def hook(page, context, **kwargs):
     print("Hook: Setting up page context")
-    # Block images to speed up crawling
-    await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
+    # Block images to speed up crawling (register per-extension)
+    for ext in ["png", "jpg", "jpeg", "gif", "webp"]:
+        await context.route(f"**/*.{ext}", lambda route: route.abort())
     print("Hook: Images blocked")
     return page
 """,
@@
         "before_goto": """
 async def hook(page, context, url, **kwargs):
     print(f"Hook: About to navigate to {url}")
     # Add custom headers
     await page.set_extra_http_headers({
         'X-Test-Header': 'crawl4ai-hooks-test'
     })
     return page
 """,
+        "before_return_html": """
+async def hook(page, context, **kwargs):
+    print("Hook: before_return_html - injecting marker")
+    await page.evaluate(\"\"\"document.body.insertAdjacentHTML('beforeend', '')\"\"\" )
+    return page
+""",
     }
@@
-            print("\nHook Execution Summary:")
-            print("🔗 before_goto: URL modified with tracking parameter")
-            print("✅ after_goto: Page navigation completed")
-            print("📝 before_return_html: Content processed and marked")
+            print("\nHook Execution Summary:")
+            print("🔗 before_goto: Custom header set")
+            print("🧹 on_page_context_created: Images blocked")
+            print("📝 before_return_html: Marker injected")
Also applies to: 141-151

47-49: Avoid bare except; catch specific exceptions

Catching everything hides useful diagnostics. Use narrower exceptions for requests and JSON parsing.
-        except:
+        except requests.RequestException:
             return False
@@
-            except:
+            except ValueError:
                 print(f"Raw response: {response.text[:500]}")
Also applies to: 155-158

160-163: Use direct f-string interpolation and keep trace context

Prefer f"{e}" (or {e!s}) to str(e). Optionally print a short traceback when debugging.
-    except requests.exceptions.Timeout:
+    except requests.exceptions.Timeout:
         print("⏰ Request timed out after 60 seconds")
-    except Exception as e:
-        print(f"❌ Error: {str(e)}")
+    except Exception as e:
+        print(f"❌ Error: {e}")
@@
-        except Exception as e:
-            print(f"❌ Demo {i} error: {str(e)}")
+        except Exception as e:
+            print(f"❌ Demo {i} error: {e}")
             print("Continuing to next demo...")
@@
-    except Exception as e:
-        print(f"\n❌ Demo error: {str(e)}")
+    except Exception as e:
+        print(f"\n❌ Demo error: {e}")
         print("Make sure you have the required dependencies installed.")
Also applies to: 276-279, 307-309

194-196: Remove f-strings without placeholders

These strings don’t interpolate values; drop the “f” prefix.
-            print(f"  - Note: Actual LLM call may fail without valid API key")
+            print("  - Note: Actual LLM call may fail without valid API key")
@@
-        import requests
-        print(f"  - Requests library: ✓")
+        import requests
+        print("  - Requests library: ✓")
     except ImportError:
-        print(f"  - Requests library: ❌")
+        print("  - Requests library: ❌")
@@
-            print(f"\n⏹️  Demo interrupted by user")
+            print("\n⏹️  Demo interrupted by user")
Also applies to: 253-256, 275-276
docs/examples/website-to-api/web_scraper_lib.py (3)
53-57: Avoid MD5 for cache keys; switch to SHA-256

MD5 is flagged as insecure even for non-crypto uses. Use SHA-256 for low collision risk at similar cost.
-    def _generate_schema_key(self, url: str, query: str) -> str:
+    def _generate_schema_key(self, url: str, query: str) -> str:
         """Generate a unique key for schema caching based on URL and query."""
-        content = f"{url}:{query}"
-        return hashlib.md5(content.encode()).hexdigest()
+        content = f"{url}:{query}".encode("utf-8")
+        return hashlib.sha256(content).hexdigest()
135-172: Crawl result access and file IO robustness

Prefer explicit encodings when caching schema.

If CrawlResultContainer does not proxy attributes, accessing result.fit_html could fail. Consider accessing the underlying result safely.
-            html = result.fit_html
+            html = getattr(result, "fit_html", None) or getattr(getattr(result, "result", None), "fit_html", "")
@@
-        with open(schema_path, "w") as f:
+        with open(schema_path, "w", encoding="utf-8") as f:
             json.dump(schema, f, indent=2)
@@
-            result = await crawler.arun(url=url, config=run_config)
+            result = await crawler.arun(url=url, config=run_config)
             # Parse extracted_content if it's a JSON string
             extracted_data = result.extracted_content
Optionally, guard extracted_content access similarly to fit_html if Container may not proxy.

Also applies to: 199-203, 328-335

354-370: Nit: use encoding and handle non-existent dir cleanly

Use explicit encoding and ignore missing dir errors.
-        import shutil
-        if os.path.exists(self.schemas_dir):
-            shutil.rmtree(self.schemas_dir)
+        import shutil
+        if os.path.exists(self.schemas_dir):
+            shutil.rmtree(self.schemas_dir)
         os.makedirs(self.schemas_dir, exist_ok=True)
         print("Schema cache cleared")
Optionally wrap rmtree in try/except OSError to report permission issues.

coderabbitai · 2025-09-29T10:25:24Z

crawl4ai/adaptive_crawler.py

+        provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
+        api_token = llm_config_dict.get('api_token') if llm_config_dict else None



⚠️ Potential issue | 🔴 Critical

Hardcoded query variations (fried rice) are a blocker.

map_query_semantic_space returns a static, irrelevant set of queries. This breaks adaptive crawling for any real query.

Immediate fix: call the LLM when configured; fallback to simple perturbations otherwise.

- # response = perform_completion_with_backoff( - # provider=provider, - # prompt_with_variables=prompt, - # api_token=api_token, - # json_response=True - # ) - # variations = json.loads(response.choices[0].message.content) - # # Mock data with more variations for split - variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']} + variations = None + if llm_config_dict: + try: + response = perform_completion_with_backoff( + provider=llm_config_dict.get('provider', 'openai/gpt-4o-mini'), + prompt_with_variables=prompt, + api_token=llm_config_dict.get('api_token'), + json_response=True + ) + variations = json.loads(response.choices[0].message.content) + except Exception: + variations = None + if not variations or 'queries' not in variations: + # Lightweight deterministic fallback + base = query.strip().rstrip("?") + variations = {'queries': [ + base, + f"{base} overview", + f"{base} tutorial", + f"{base} examples", + f"{base} best practices", + f"{base} troubleshooting", + f"{base} advanced", + f"{base} quick start", + f"{base} guide", + f"{base} faq", + f"{base} tips" + ]} @@ - other_queries = variations['queries'].copy() + other_queries = [q for q in variations['queries'] if q != query]

Also applies to: 741-744, 765-781

🧰 Tools

🪛 Ruff (0.13.1)

728-728: Local variable provider is assigned to but never used

Remove assignment to unused variable provider

(F841)

729-729: Local variable api_token is assigned to but never used

Remove assignment to unused variable api_token

(F841)

coderabbitai · 2025-09-29T10:25:24Z

crawl4ai/async_configs.py

+            warnings.warn("The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", UserWarning)
        self.proxy = proxy
        self.proxy_config = proxy_config
        if isinstance(self.proxy_config, dict):
            self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
        if isinstance(self.proxy_config, str):
            self.proxy_config = ProxyConfig.from_string(self.proxy_config)
-
+
+        if self.proxy and self.proxy_config:
+            warnings.warn("Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", UserWarning)
+            self.proxy = None
+        elif self.proxy:
+            # Convert proxy string to ProxyConfig if proxy_config is not provided
+            self.proxy_config = ProxyConfig.from_string(self.proxy)
+            self.proxy = None



🛠️ Refactor suggestion | 🟠 Major

Deprecation warnings: add stacklevel=2.

So the warning points to the caller.

- warnings.warn("The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", UserWarning) + warnings.warn( + "The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", + UserWarning, + stacklevel=2, + ) @@ - warnings.warn("Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", UserWarning) + warnings.warn( + "Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", + UserWarning, + stacklevel=2, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

warnings.warn("The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", UserWarning)

self.proxy = proxy

self.proxy_config = proxy_config

if isinstance(self.proxy_config, dict):

self.proxy_config = ProxyConfig.from_dict(self.proxy_config)

if isinstance(self.proxy_config, str):

self.proxy_config = ProxyConfig.from_string(self.proxy_config)

if self.proxy and self.proxy_config:

warnings.warn("Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", UserWarning)

self.proxy = None

elif self.proxy:

# Convert proxy string to ProxyConfig if proxy_config is not provided

self.proxy_config = ProxyConfig.from_string(self.proxy)

self.proxy = None

warnings.warn(

"The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.",

UserWarning,

stacklevel=2,

)

self.proxy = proxy

self.proxy_config = proxy_config

if isinstance(self.proxy_config, dict):

self.proxy_config = ProxyConfig.from_dict(self.proxy_config)

if isinstance(self.proxy_config, str):

self.proxy_config = ProxyConfig.from_string(self.proxy_config)

if self.proxy and self.proxy_config:

warnings.warn(

"Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.",

UserWarning,

stacklevel=2,

)

self.proxy = None

elif self.proxy:

# Convert proxy string to ProxyConfig if proxy_config is not provided

self.proxy_config = ProxyConfig.from_string(self.proxy)

self.proxy = None

🧰 Tools

🪛 Ruff (0.13.1)

471-471: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

480-480: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

🤖 Prompt for AI Agents

In crawl4ai/async_configs.py around lines 471 to 486, the warnings.warn calls should point to the caller; update each warnings.warn in this block to include stacklevel=2 (e.g., warnings.warn(..., UserWarning, stacklevel=2)) so the warning location is the caller rather than inside this module; ensure you add stacklevel=2 to the deprecation warning and to the "Both 'proxy' and 'proxy_config'" warning (and any other warnings.warn in this vicinity).

coderabbitai · 2025-09-29T10:25:24Z

crawl4ai/async_configs.py

+    _UNWANTED_PROPS = {
+        'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
+        'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
+        'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
+        'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
+    }


🛠️ Refactor suggestion | 🟠 Major

Annotate _UNWANTED_PROPS as ClassVar (RUF012).

Prevents it from being treated as a mutable instance attribute.

-from typing import Any, Dict, Optional +from typing import Any, Dict, Optional, ClassVar @@ - _UNWANTED_PROPS = { + _UNWANTED_PROPS: ClassVar[dict[str, str]] = { 'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED', 'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS', 'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY', 'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY', }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_UNWANTED_PROPS = {

'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',

'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',

'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',

'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',

}

from typing import Any, Dict, Optional, ClassVar

_UNWANTED_PROPS: ClassVar[dict[str, str]] = {

'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',

'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',

'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',

'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',

}

🧰 Tools

🪛 Ruff (0.13.1)

1069-1074: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents

In crawl4ai/async_configs.py around lines 1069 to 1074, the _UNWANTED_PROPS dict is currently an un-annotated class-level mutable which can be treated as an instance attribute; annotate it as a ClassVar to signal it is a class-level constant. Import ClassVar from typing (if not already imported) and change the declaration to annotate _UNWANTED_PROPS as ClassVar[dict[str, str]] (or ClassVar[Mapping[str, str]]), leaving the literal contents unchanged so it remains a class-level attribute not a mutable instance field.

coderabbitai · 2025-09-29T10:25:24Z

crawl4ai/deep_crawling/bfs_strategy.py

+        if isinstance(logger, logging.Logger):
+            self.logger = logger
+        else:
+            # Create a new logger if logger is None, dict, or any other non-Logger type
+            self.logger = logging.getLogger(__name__)


⚠️ Potential issue | 🟠 Major

Keep support for LoggerAdapter-style loggers

Before this change we could pass logging.LoggerAdapter (or any duck-typed logger with .info/.debug etc.) and it worked. Now anything that isn’t a concrete logging.Logger silently falls back to logging.getLogger(__name__), dropping adapter context and breaking existing integrations (e.g., structlog wrappers, adapters adding request IDs). Please broaden the check (or use duck typing) instead of hard-rejecting adapters.

- if isinstance(logger, logging.Logger): - self.logger = logger - else: - # Create a new logger if logger is None, dict, or any other non-Logger type - self.logger = logging.getLogger(__name__) + if logger is None: + self.logger = logging.getLogger(__name__) + elif isinstance(logger, (logging.Logger, logging.LoggerAdapter)): + self.logger = logger + elif hasattr(logger, "info") and hasattr(logger, "debug"): + # allow duck-typed loggers (structlog, custom wrappers) + self.logger = logger + else: + self.logger = logging.getLogger(__name__)

🤖 Prompt for AI Agents

In crawl4ai/deep_crawling/bfs_strategy.py around lines 43 to 47, the current code only accepts concrete logging.Logger instances and replaces anything else (including logging.LoggerAdapter or duck-typed loggers) with logging.getLogger(__name__); change the check to accept LoggerAdapter or any object exposing the standard logging methods instead of hard-rejecting them: if logger is None create a new logger, otherwise if the passed object has the typical logging callables (e.g., hasattr(logger, "info") and hasattr(logger, "debug") and hasattr(logger, "warning")) assign it directly to self.logger; fall back to logging.getLogger(__name__) only when logger is None or does not expose the logging methods.

coderabbitai · 2025-09-29T10:25:25Z

crawl4ai/models.py

+        # Remove any property descriptors that might have been included
+        # These deprecated properties should not be in the serialized output
+        for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
+            if key in result and isinstance(result[key], property):
+                # del result[key]
+                # Nasrin: I decided to convert it to string instead of removing it.
+                result[key] = str(result[key])
+


⚠️ Potential issue | 🟠 Major

Revert to dropping deprecated property placeholders from serialized payloads.

By converting the property descriptors to strings we now emit values like "<property object at 0x...>" in the API payload. Clients that previously relied on the fields being absent (or actual string content) will now receive meaningless placeholders, breaking backward compatibility and defeating the intent of filtering these deprecated keys. Please keep deleting these entries instead of stringifying them.

Apply this diff to restore the previous behavior:

- for key in ['fit_html', 'fit_markdown', 'markdown_v2']: - if key in result and isinstance(result[key], property): - # del result[key] - # Nasrin: I decided to convert it to string instead of removing it. - result[key] = str(result[key]) + for key in ['fit_html', 'fit_markdown', 'markdown_v2']: + if key in result and isinstance(result[key], property): + del result[key]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Remove any property descriptors that might have been included

# These deprecated properties should not be in the serialized output

for key in ['fit_html', 'fit_markdown', 'markdown_v2']:

if key in result and isinstance(result[key], property):

# del result[key]

# Nasrin: I decided to convert it to string instead of removing it.

result[key] = str(result[key])

# Remove any property descriptors that might have been included

# These deprecated properties should not be in the serialized output

for key in ['fit_html', 'fit_markdown', 'markdown_v2']:

if key in result and isinstance(result[key], property):

del result[key]

🤖 Prompt for AI Agents

In crawl4ai/models.py around lines 257 to 264, revert the change that stringifies deprecated property descriptors; instead remove those keys from the serialized result so clients don’t receive placeholder strings. For each key in ['fit_html','fit_markdown','markdown_v2'] check if the key exists and if its value is a property (or property-like placeholder) then delete it from result (e.g., use del result[key] or result.pop(key, None)) rather than converting to str; keep the rest of the serialization unchanged.

coderabbitai · 2025-09-29T10:25:28Z

tests/docker/test_llm_params.py

+    console.print(Panel.fit(
+        f"[cyan]POST {endpoint}[/cyan]\n{syntax}",
+        title=f"[bold blue]{title}[/]",
+        border_style="blue"
+    ))


⚠️ Potential issue | 🟡 Minor

Restore Rich syntax rendering for request panels

Embedding the Syntax renderable in the f-string coerces it to its repr, so the panel prints something like <rich.syntax.Syntax object …> instead of the highlighted payload. Let Rich handle the renderable directly.

- console.print(Panel.fit( - f"[cyan]POST {endpoint}[/cyan]\n{syntax}", - title=f"[bold blue]{title}[/]", - border_style="blue" - )) + console.print(Panel.fit( + syntax, + title=f"[bold blue]{title}[/]", + border_style="blue", + subtitle=f"[cyan]POST {endpoint}[/cyan]", + subtitle_align="left", + ))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

console.print(Panel.fit(

f"[cyan]POST {endpoint}[/cyan]\n{syntax}",

title=f"[bold blue]{title}[/]",

border_style="blue"

))

console.print(Panel.fit(

syntax,

title=f"[bold blue]{title}[/]",

border_style="blue",

subtitle=f"[cyan]POST {endpoint}[/cyan]",

subtitle_align="left",

))

🤖 Prompt for AI Agents

In tests/docker/test_llm_params.py around lines 45-49 the Syntax object is being embedded in an f-string which coerces it to its repr; instead pass renderables directly to Panel.fit so Rich can render the Syntax. Replace the f-string usage with a renderable container (e.g., rich.console.Group or a list/tuple) that includes the POST header string and the Syntax object, and pass that container as the first argument to Panel.fit while keeping the title and border_style unchanged.

coderabbitai · 2025-09-29T10:25:28Z

tests/docker/test_server_requests.py

+        async with async_client.stream("POST", "/crawl", json=payload) as response:
+            assert response.status_code == 200
+            assert response.headers["content-type"] == "application/x-ndjson"
+            assert response.headers.get("x-stream-status") == "active"
+
+            results = await process_streaming_response(response)
+
+            assert len(results) == 1
+            result = results[0]


⚠️ Potential issue | 🟠 Major

Loosen the streaming content-type assertion
httpx will happily surface content-type values like application/x-ndjson; charset=utf-8. With the current exact match, this test will fail even though the server is doing the right thing. Please relax the check (e.g., use .startswith("application/x-ndjson") or parse the media type) so we only fail on genuinely incorrect responses.

- assert response.headers["content-type"] == "application/x-ndjson" + content_type = response.headers.get("content-type", "") + assert content_type.startswith("application/x-ndjson")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async with async_client.stream("POST", "/crawl", json=payload) as response:

assert response.status_code == 200

assert response.headers["content-type"] == "application/x-ndjson"

assert response.headers.get("x-stream-status") == "active"

results = await process_streaming_response(response)

assert len(results) == 1

result = results[0]

async with async_client.stream("POST", "/crawl", json=payload) as response:

assert response.status_code == 200

content_type = response.headers.get("content-type", "")

assert content_type.startswith("application/x-ndjson")

assert response.headers.get("x-stream-status") == "active"

results = await process_streaming_response(response)

assert len(results) == 1

result = results[0]

🤖 Prompt for AI Agents

In tests/docker/test_server_requests.py around lines 167 to 175, the test asserts an exact content-type match "application/x-ndjson" which fails when the server returns parameters like a charset; change the assertion to accept media type variants by checking that response.headers["content-type"].lower().startswith("application/x-ndjson") (or parse the media type and compare only the type/subtype) so the test only fails for genuinely wrong content-types.

coderabbitai · 2025-09-29T10:25:28Z

tests/general/test_bff_scoring.py

+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
+from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
+
+async def test_best_first_strategy():
+    """Test BestFirstCrawlingStrategy with keyword scoring"""
+
+    print("=" * 70)
+    print("Testing BestFirstCrawlingStrategy with Real URL")
+    print("=" * 70)
+    print("\nThis test will:")
+    print("1. Crawl Python.org documentation")
+    print("2. Score pages based on keywords: 'tutorial', 'guide', 'reference'")
+    print("3. Show that higher-scoring pages are crawled first")
+    print("-" * 70)
+
+    # Create a keyword scorer that prioritizes tutorial/guide pages
+    scorer = KeywordRelevanceScorer(
+        keywords=["tutorial", "guide", "reference", "documentation"],
+        weight=1.0,
+        case_sensitive=False
+    )
+
+    # Create the strategy with scoring
+    strategy = BestFirstCrawlingStrategy(
+        max_depth=2,          # Crawl 2 levels deep
+        max_pages=10,         # Limit to 10 pages total
+        url_scorer=scorer,    # Use keyword scoring
+        include_external=False  # Only internal links
+    )
+
+    # Configure browser and crawler
+    browser_config = BrowserConfig(
+        headless=True,    # Run in background
+        verbose=False     # Reduce output noise
+    )
+
+    crawler_config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        verbose=False
+    )
+
+    print("\nStarting crawl of https://docs.python.org/3/")
+    print("Looking for pages with keywords: tutorial, guide, reference, documentation")
+    print("-" * 70)
+
+    crawled_urls = []
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        # Crawl and collect results
+        results = await crawler.arun(
+            url="https://docs.python.org/3/",
+            config=crawler_config
+        )


⚠️ Potential issue | 🟠 Major

Avoid external network calls in the test suite

This test hits https://docs.python.org/3/, so it will flake whenever the network, DNS, or the remote site is slow or offline. Please swap in a local fixture (mock server, recorded response, or static file) or mark the test skipped unless an opt-in flag enables network access.

🤖 Prompt for AI Agents

In tests/general/test_bff_scoring.py around lines 10 to 63, the test performs an external network call to https://docs.python.org/3/ which makes the suite flaky; replace the external request with a deterministic local fixture or opt-in skip: either (a) add a pytest fixture that starts a local HTTP test server serving a small snapshot HTML set and point the crawler to that local URL, or (b) monkeypatch/fixture the AsyncWebCrawler.arun method to return a canned result set (or use a recorded response), or (c) mark the test with pytest.mark.skipif(not os.getenv("ENABLE_NETWORK_TESTS")) so it only runs when an opt-in environment variable is set. Ensure the test no longer relies on external DNS/network and update setup/teardown fixtures accordingly.

coderabbitai · 2025-09-29T10:25:28Z

tests/general/test_bff_scoring.py

+    # Check if higher scores appear early in the crawl
+    scores = [item['score'] for item in crawled_urls[1:]]  # Skip initial URL
+    high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]
+
+    if high_score_indices and high_score_indices[0] < len(scores) / 2:
+        print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!")
+        print("   This confirms the priority queue fix is working.")
+    else:
+        print("⚠️  Check the crawl order above - higher scores should appear early")
+
+    # Show score distribution
+    print(f"\nScore Statistics:")
+    print(f"  - Total pages crawled: {len(crawled_urls)}")
+    print(f"  - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")
+    print(f"  - Max score: {max(item['score'] for item in crawled_urls):.2f}")
+    print(f"  - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")
+


⚠️ Potential issue | 🟠 Major

Add assertions so the test actually enforces behavior

Right now the “analysis” block only prints guidance; the test never fails if the priority queue regresses. Convert those checks into real assertions (or explicit pytest.fail) so a scoring regression fails the build instead of silently passing.

- if high_score_indices and high_score_indices[0] < len(scores) / 2: - print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!") - print(" This confirms the priority queue fix is working.") - else: - print("⚠️ Check the crawl order above - higher scores should appear early") + assert scores, "No crawl results recorded; check earlier setup to ensure the strategy ran" + assert high_score_indices and high_score_indices[0] < len(scores) / 2, ( + "Higher-scoring pages should be crawled early; inspect priority queue weighting" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check if higher scores appear early in the crawl

scores = [item['score'] for item in crawled_urls[1:]] # Skip initial URL

high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]

if high_score_indices and high_score_indices[0] < len(scores) / 2:

print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!")

print(" This confirms the priority queue fix is working.")

else:

print("⚠️ Check the crawl order above - higher scores should appear early")

# Show score distribution

print(f"\nScore Statistics:")

print(f" - Total pages crawled: {len(crawled_urls)}")

print(f" - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")

print(f" - Max score: {max(item['score'] for item in crawled_urls):.2f}")

print(f" - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")

# Check if higher scores appear early in the crawl

scores = [item['score'] for item in crawled_urls[1:]] # Skip initial URL

high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]

assert scores, "No crawl results recorded; check earlier setup to ensure the strategy ran"

assert high_score_indices and high_score_indices[0] < len(scores) / 2, (

"Higher-scoring pages should be crawled early; inspect priority queue weighting"

)

# Show score distribution

print(f"\nScore Statistics:")

print(f" - Total pages crawled: {len(crawled_urls)}")

print(f" - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")

print(f" - Max score: {max(item['score'] for item in crawled_urls):.2f}")

print(f" - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")

🧰 Tools

🪛 Ruff (0.13.1)

105-105: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In tests/general/test_bff_scoring.py around lines 94 to 110, the test currently only prints pass/fail guidance for crawl ordering and score stats, so regressions won’t fail CI; replace the prints with real test assertions (or pytest.fail) that enforce behavior: assert that there is at least one high-scoring page (e.g. any score > 0.3) and that the first high-scoring page index (relative to scores after skipping initial URL) is < len(scores) / 2; also assert non-zero total pages before computing averages to avoid ZeroDivisionError and assert expected ranges for average/max scores if desired, failing explicitly when conditions aren’t met.

coderabbitai · 2025-09-29T10:25:28Z

tests/proxy/test_proxy_deprecation.py

+    warnings.simplefilter("always", DeprecationWarning)
+
+    proxy_str = "23.95.150.145:6114:username:password"
+    with warnings.catch_warnings(record=True) as caught:
+        cfg = BrowserConfig(proxy=proxy_str, headless=True)
+
+    dep_warnings = [w for w in caught if issubclass(w.category, DeprecationWarning)]
+    assert dep_warnings, "Expected DeprecationWarning when using BrowserConfig(proxy=...)"
+
+    assert cfg.proxy is None, "cfg.proxy should be None after auto-conversion"
+    assert isinstance(cfg.proxy_config, ProxyConfig), "cfg.proxy_config should be ProxyConfig instance"
+    assert cfg.proxy_config.username == "username"
+    assert cfg.proxy_config.password == "password"
+    assert cfg.proxy_config.server.startswith("http://")
+    assert cfg.proxy_config.server.endswith(":6114")
+
+
+def test_browser_config_with_proxy_config_emits_no_deprecation():
+    warnings.simplefilter("always", DeprecationWarning)
+
+    with warnings.catch_warnings(record=True) as caught:
+        cfg = BrowserConfig(
+            headless=True,
+            proxy_config={
+                "server": "http://127.0.0.1:8080",
+                "username": "u",
+                "password": "p",
+            },


⚠️ Potential issue | 🟠 Major

Avoid leaking global warning filters
Calling warnings.simplefilter("always", DeprecationWarning) outside the catch_warnings context mutates the global filter list for the entire test run. That means every subsequent test (even in other modules) will start emitting DeprecationWarnings, which is exactly the kind of test cross-talk we try to prevent. Please move the filter inside the catch_warnings context (and do the same in the second test) so the override is scoped to this block.

- warnings.simplefilter("always", DeprecationWarning) - - proxy_str = "23.95.150.145:6114:username:password" - with warnings.catch_warnings(record=True) as caught: + proxy_str = "23.95.150.145:6114:username:password" + with warnings.catch_warnings(record=True) as caught: + warnings.simplefilter("always", DeprecationWarning) cfg = BrowserConfig(proxy=proxy_str, headless=True) @@ - warnings.simplefilter("always", DeprecationWarning) - - with warnings.catch_warnings(record=True) as caught: + with warnings.catch_warnings(record=True) as caught: + warnings.simplefilter("always", DeprecationWarning) cfg = BrowserConfig(

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

warnings.simplefilter("always", DeprecationWarning)

proxy_str = "23.95.150.145:6114:username:password"

with warnings.catch_warnings(record=True) as caught:

cfg = BrowserConfig(proxy=proxy_str, headless=True)

dep_warnings = [w for w in caught if issubclass(w.category, DeprecationWarning)]

assert dep_warnings, "Expected DeprecationWarning when using BrowserConfig(proxy=...)"

assert cfg.proxy is None, "cfg.proxy should be None after auto-conversion"

assert isinstance(cfg.proxy_config, ProxyConfig), "cfg.proxy_config should be ProxyConfig instance"

assert cfg.proxy_config.username == "username"

assert cfg.proxy_config.password == "password"

assert cfg.proxy_config.server.startswith("http://")

assert cfg.proxy_config.server.endswith(":6114")

def test_browser_config_with_proxy_config_emits_no_deprecation():

warnings.simplefilter("always", DeprecationWarning)

with warnings.catch_warnings(record=True) as caught:

cfg = BrowserConfig(

headless=True,

proxy_config={

"server": "http://127.0.0.1:8080",

"username": "u",

"password": "p",

},

proxy_str = "23.95.150.145:6114:username:password"

with warnings.catch_warnings(record=True) as caught:

warnings.simplefilter("always", DeprecationWarning)

cfg = BrowserConfig(proxy=proxy_str, headless=True)

dep_warnings = [w for w in caught if issubclass(w.category, DeprecationWarning)]

assert dep_warnings, "Expected DeprecationWarning when using BrowserConfig(proxy=...)"

assert cfg.proxy is None, "cfg.proxy should be None after auto-conversion"

assert isinstance(cfg.proxy_config, ProxyConfig), "cfg.proxy_config should be ProxyConfig instance"

assert cfg.proxy_config.username == "username"

assert cfg.proxy_config.password == "password"

assert cfg.proxy_config.server.startswith("http://")

assert cfg.proxy_config.server.endswith(":6114")

def test_browser_config_with_proxy_config_emits_no_deprecation():

with warnings.catch_warnings(record=True) as caught:

warnings.simplefilter("always", DeprecationWarning)

cfg = BrowserConfig(

headless=True,

proxy_config={

"server": "http://127.0.0.1:8080",

"username": "u",

"password": "p",

},

)

🧰 Tools

🪛 Ruff (0.13.1)

21-21: Possible hardcoded password assigned to: "password"

(S105)

🤖 Prompt for AI Agents

In tests/proxy/test_proxy_deprecation.py around lines 9 to 36, the call to warnings.simplefilter("always", DeprecationWarning) is applied globally before entering warnings.catch_warnings, which mutates global warning filters; move the simplefilter call inside each with warnings.catch_warnings(record=True) as caught: block (for both tests) so the filter is scoped to the context manager, ensuring you set the filter immediately after entering the catch_warnings block and before creating the BrowserConfig, and remove the external/global simplefilter call.

…entation. ref #1443

- Add comprehensive brand book with color system, typography, components - Add page copy dropdown with markdown copy/view functionality - Update mkdocs.yml with new assets and branding navigation - Use terminal-style ASCII icons and condensed menu design

…able in the except block

- Implement marketplace frontend and admin dashboard - Add FastAPI backend with environment-based configuration - Use .env file for secrets management - Include data generation scripts - Add proper CORS configuration - Remove hardcoded password from admin login - Update gitignore for security

…PI endpoints - Change API_BASE to relative '/api' for production - Move marketplace to /marketplace instead of /marketplace/frontend - Update MkDocs navigation - Fix logo path in marketplace index

Fix: run_urls() returns None, crashing arun_many()

…develop

Marketplace and brand book changes

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (6)

deploy/docker/api.py (2)
263-265: Consider parameter naming consistency.

The parameter api_base_url is used here, while most other functions in this file use base_url for the same purpose (lines 124, 191). Consider renaming to base_url for consistency across the API surface, unless there's a specific reason for the different name.

Apply this diff to align the parameter name:
 async def handle_llm_request(
     redis: aioredis.Redis,
     background_tasks: BackgroundTasks,
     request: Request,
     input_path: str,
     query: Optional[str] = None,
     schema: Optional[str] = None,
     cache: str = "0",
     config: Optional[dict] = None,
     provider: Optional[str] = None,
     temperature: Optional[float] = None,
-    api_base_url: Optional[str] = None
+    base_url: Optional[str] = None
 ) -> JSONResponse:
And update the call at line 298:
         return await create_new_task(
             redis,
             background_tasks,
             input_path,
             query,
             schema,
             cache,
             base_url,
             config,
             provider,
             temperature,
-            api_base_url
+            base_url
         )
And update the signature at line 345:
 async def create_new_task(
     redis: aioredis.Redis,
     background_tasks: BackgroundTasks,
     input_path: str,
     query: str,
     schema: Optional[str],
     cache: str,
     base_url: str,
     config: dict,
     provider: Optional[str] = None,
     temperature: Optional[float] = None,
-    api_base_url: Optional[str] = None
+    base_url: Optional[str] = None
 ) -> JSONResponse:
And update the call at line 372:
     background_tasks.add_task(
         process_llm_extraction,
         redis,
         config,
         task_id,
         decoded_url,
         query,
         schema,
         cache,
         provider,
         temperature,
-        api_base_url
+        base_url
     )
692-692: Consider simplifying the timestamp creation.

The pattern datetime.now(timezone.utc).replace(tzinfo=None).isoformat() creates a timezone-aware datetime but then immediately removes the timezone information. This has the same effect as datetime.utcnow().isoformat() but is more verbose.

If the goal is to store a UTC timestamp without timezone information (possibly for Redis compatibility), consider using the simpler pattern for clarity:
-        "created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
+        "created_at": datetime.utcnow().isoformat(),
Alternatively, if you want to preserve timezone information in the stored timestamp, remove the .replace(tzinfo=None) call:
-        "created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
+        "created_at": datetime.now(timezone.utc).isoformat(),
docs/md_v2/marketplace/README.md (1)

25-25: Wrap bare URL to satisfy markdownlint.

markdownlint (MD034) flags the bare http://localhost:8100. Enclose it in angle brackets (<http://…>) so the docs pipeline stays green.

docs/md_v2/marketplace/marketplace.css (1)

1-957: Avoid maintaining two divergent copies of the same stylesheet.

This file is byte-for-byte identical to docs/md_v2/marketplace/frontend/marketplace.css. Carrying two full copies will drift almost immediately and doubles every future CSS change. Please move the shared rules into a single asset (e.g., keep one file and @import it from the other, or factor common pieces into assets/styles.css) so both contexts stay in sync with one source of truth.
docs/md_v2/marketplace/backend/config.py (1)
48-59: Make ALLOWED_ORIGINS immutable.

ALLOWED_ORIGINS is a mutable list at class scope, so any accidental mutation is shared process-wide and Ruff flags it (RUF012). Switch to an immutable tuple (or annotate with ClassVar) to satisfy the linter and avoid shared-state surprises.
-    ALLOWED_ORIGINS = [
+    ALLOWED_ORIGINS = (
         "http://localhost:8000",
         "http://localhost:8080",
         "http://localhost:8100",
         "http://127.0.0.1:8000",
         "http://127.0.0.1:8080",
         "http://127.0.0.1:8100",
         "https://crawl4ai.com",
         "https://www.crawl4ai.com",
         "https://docs.crawl4ai.com",
         "https://market.crawl4ai.com"
-    ]
+    )
Based on static analysis hints.
docs/md_v2/marketplace/frontend/app-detail.js (1)
147-161: Update proxy example to the new proxy_config API.

v0.7.5 deprecates the proxy= kwarg in favor of proxy_config, but the Proxy Services snippet still shows the old signature. Please refresh the example so the docs reinforce the new structure.
-async with AsyncWebCrawler(proxy=proxy_config) as crawler:
+async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 361499d and 9900f63.

⛔ Files ignored due to path filters (1)

docs/md_v2/assets/images/logo.png is excluded by !**/*.png

📒 Files selected for processing (29)

.gitignore (2 hunks)
crawl4ai/async_dispatcher.py (1 hunks)
deploy/docker/api.py (19 hunks)
docs/md_v2/assets/page_actions.css (1 hunks)
docs/md_v2/assets/page_actions.js (1 hunks)
docs/md_v2/branding/index.md (1 hunks)
docs/md_v2/marketplace/README.md (1 hunks)
docs/md_v2/marketplace/admin/admin.css (1 hunks)
docs/md_v2/marketplace/admin/admin.js (1 hunks)
docs/md_v2/marketplace/admin/index.html (1 hunks)
docs/md_v2/marketplace/app-detail.css (1 hunks)
docs/md_v2/marketplace/app-detail.js (1 hunks)
docs/md_v2/marketplace/backend/.env.example (1 hunks)
docs/md_v2/marketplace/backend/config.py (1 hunks)
docs/md_v2/marketplace/backend/database.py (1 hunks)
docs/md_v2/marketplace/backend/dummy_data.py (1 hunks)
docs/md_v2/marketplace/backend/requirements.txt (1 hunks)
docs/md_v2/marketplace/backend/schema.yaml (1 hunks)
docs/md_v2/marketplace/backend/server.py (1 hunks)
docs/md_v2/marketplace/frontend/app-detail.css (1 hunks)
docs/md_v2/marketplace/frontend/app-detail.html (1 hunks)
docs/md_v2/marketplace/frontend/app-detail.js (1 hunks)
docs/md_v2/marketplace/frontend/index.html (1 hunks)
docs/md_v2/marketplace/frontend/marketplace.css (1 hunks)
docs/md_v2/marketplace/frontend/marketplace.js (1 hunks)
docs/md_v2/marketplace/index.html (1 hunks)
docs/md_v2/marketplace/marketplace.css (1 hunks)
docs/md_v2/marketplace/marketplace.js (1 hunks)
mkdocs.yml (4 hunks)

✅ Files skipped from review due to trivial changes (1)

docs/md_v2/marketplace/backend/schema.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

.gitignore

🧰 Additional context used

🧬 Code graph analysis (9)

docs/md_v2/marketplace/backend/dummy_data.py (1)

docs/md_v2/marketplace/backend/database.py (1)

DatabaseManager (7-117)

docs/md_v2/marketplace/frontend/app-detail.js (4)

docs/md_v2/marketplace/admin/admin.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/marketplace.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/marketplace.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/marketplace.js (4)

docs/md_v2/marketplace/admin/admin.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/marketplace.js (3)

API_BASE (2-2)

CACHE_TTL (3-3)

marketplace (392-392)

docs/md_v2/marketplace/app-detail.js (4)

docs/md_v2/marketplace/admin/admin.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/marketplace.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/marketplace.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/admin/admin.js (1)

docs/md_v2/marketplace/marketplace.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/marketplace.js (3)

docs/md_v2/marketplace/admin/admin.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/app-detail.js (1)

API_BASE (2-2)

deploy/docker/api.py (4)

deploy/docker/utils.py (4)

validate_llm_provider (92-108)

get_llm_temperature (111-145)

get_llm_base_url (148-171)

get_llm_api_key (74-89)

crawl4ai/async_configs.py (1)

LLMConfig (1703-1785)

deploy/docker/hook_manager.py (3)

attach_user_hooks_to_crawler (455-512)

UserHookManager (15-270)

get_summary (256-270)

crawl4ai/models.py (1)

model_dump (240-268)

docs/md_v2/marketplace/backend/server.py (2)

docs/md_v2/marketplace/backend/database.py (3)

DatabaseManager (7-117)

get_all (80-89)

search (91-113)

docs/md_v2/marketplace/backend/config.py (1)

Config (30-59)

docs/md_v2/marketplace/backend/database.py (1)

docs/md_v2/marketplace/backend/server.py (1)

search (146-164)

🪛 dotenv-linter (3.3.0)

docs/md_v2/marketplace/backend/.env.example

[warning] 14-14: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🪛 markdownlint-cli2 (0.18.1)

docs/md_v2/branding/index.md

755-755: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

1046-1046: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

1057-1057: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

1068-1068: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

1218-1218: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

1230-1230: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

1248-1248: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

docs/md_v2/marketplace/README.md

25-25: Bare URL used

(MD034, no-bare-urls)

🪛 Ruff (0.13.2)

docs/md_v2/marketplace/backend/config.py

48-59: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

docs/md_v2/marketplace/backend/dummy_data.py

17-17: Possible SQL injection vector through string-based query construction

(S608)

136-136: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

136-136: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

137-137: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

137-137: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

149-149: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

230-230: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

247-247: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

248-248: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

deploy/docker/api.py

544-544: Do not catch blind exception: Exception

(BLE001)

545-545: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

576-576: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

584-584: Consider moving this statement to an else block

(TRY300)

659-659: Consider moving this statement to an else block

(TRY300)

docs/md_v2/marketplace/backend/server.py

179-179: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

239-239: Possible SQL injection vector through string-based query construction

(S608)

242-242: Consider moving this statement to an else block

(TRY300)

243-243: Do not catch blind exception: Exception

(BLE001)

244-244: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

257-257: Possible SQL injection vector through string-based query construction

(S608)

258-258: Consider [*list(app_data.values()), app_id] instead of concatenation

Replace with [*list(app_data.values()), app_id]

(RUF005)

260-260: Consider moving this statement to an else block

(TRY300)

261-261: Do not catch blind exception: Exception

(BLE001)

262-262: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

284-284: Possible SQL injection vector through string-based query construction

(S608)

287-287: Consider moving this statement to an else block

(TRY300)

288-288: Do not catch blind exception: Exception

(BLE001)

289-289: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

301-301: Possible SQL injection vector through string-based query construction

(S608)

302-302: Consider [*list(article_data.values()), article_id] instead of concatenation

Replace with [*list(article_data.values()), article_id]

(RUF005)

304-304: Consider moving this statement to an else block

(TRY300)

305-305: Do not catch blind exception: Exception

(BLE001)

306-306: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

324-324: Possible SQL injection vector through string-based query construction

(S608)

327-327: Consider moving this statement to an else block

(TRY300)

328-328: Do not catch blind exception: Exception

(BLE001)

329-329: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

337-337: Possible SQL injection vector through string-based query construction

(S608)

338-338: Consider [*list(category_data.values()), cat_id] instead of concatenation

Replace with [*list(category_data.values()), cat_id]

(RUF005)

340-340: Consider moving this statement to an else block

(TRY300)

341-341: Do not catch blind exception: Exception

(BLE001)

342-342: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

352-352: Possible SQL injection vector through string-based query construction

(S608)

355-355: Consider moving this statement to an else block

(TRY300)

356-356: Do not catch blind exception: Exception

(BLE001)

357-357: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

365-365: Possible SQL injection vector through string-based query construction

(S608)

366-366: Consider [*list(sponsor_data.values()), sponsor_id] instead of concatenation

Replace with [*list(sponsor_data.values()), sponsor_id]

(RUF005)

368-368: Consider moving this statement to an else block

(TRY300)

369-369: Do not catch blind exception: Exception

(BLE001)

370-370: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

docs/md_v2/marketplace/backend/database.py

31-31: f-string without any placeholders

Remove extraneous f prefix

(F541)

80-80: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

82-82: Possible SQL injection vector through string-based query construction

(S608)

91-91: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

108-108: Possible SQL injection vector through string-based query construction

(S608)

crawl4ai/async_dispatcher.py

468-468: return inside finally blocks cause exceptions to be silenced

(B012)

🔇 Additional comments (18)

deploy/docker/api.py (18)

7-7: LGTM!

Adding timezone support for task creation timestamps is a good practice for consistent time handling across deployments.

45-47: LGTM!

The new utility functions properly centralize LLM configuration logic with provider-specific environment variable resolution.

101-103: LGTM!

The migration to centralized helper functions is correct, and the approach of returning None to let litellm handle defaults is appropriate.

122-124: LGTM!

Adding per-request LLM configuration parameters aligns with the PR objectives for enhanced LLM integration with multi-provider support.

136-142: LGTM!

The configuration priority (request parameters → provider-specific env vars → global defaults) is correctly implemented and aligns with the multi-provider environment variable support described in PR objectives.

189-191: LGTM!

Consistent parameter additions for LLM configuration propagation across the API surface.

216-218: LGTM!

LLM configuration follows the established pattern consistently.

416-418: LGTM!

Proper serialization guard for fit_html. The defensive check prevents serialization errors while maintaining backward compatibility by setting non-serializable values to None.

446-447: LGTM!

Adding hooks support to the crawl request handler aligns with the Docker Hooks System feature described in PR objectives.

473-485: LGTM!

Hooks attachment logic is well-structured with proper error isolation and configurable timeouts. The local imports keep the dependency optional.

488-493: LGTM!

The improved base config merging logic now correctly handles both None and empty string cases, ensuring user-provided values always take precedence.

503-505: LGTM!

Normalizing results to always be a list improves API contract consistency and simplifies downstream processing.

520-550: LGTM!

The enhanced result processing significantly improves robustness by handling multiple result types and serialization edge cases. The broad exception catching (flagged by static analysis at line 544) is appropriate here for defensive error handling to maintain API stability.

535-537: LGTM!

Consistent serialization handling for fit_html across streaming and non-streaming code paths.

560-583: LGTM!

The hooks information integration is well-designed with defensive serialization testing and graceful error handling. The static analysis suggestion (TRY300 at line 584) about moving the return to an else block is a minor style preference that doesn't affect correctness.

639-651: LGTM!

Hooks attachment for streaming follows the same pattern as non-streaming requests, ensuring consistency across both code paths.

659-659: LGTM!

The return statement correctly includes hooks_info to match the updated function signature. The static analysis suggestion (TRY300) is a minor style preference.

614-615: All callers handle the new 3-tuple return signature correctly.

coderabbitai · 2025-10-03T05:33:02Z

crawl4ai/async_dispatcher.py

            memory_monitor.cancel()
            if self.monitor:
                self.monitor.stop()
+            return results


⚠️ Potential issue | 🟡 Minor

Return in finally block can silence exceptions during cleanup.

Moving the return statement into the finally block guarantees a single exit point but introduces a risk: if an exception occurs during cleanup (e.g., memory_monitor.cancel() at line 465 or self.monitor.stop() at line 467), it will be silenced by the return statement. Additionally, the current exception handling at line 459 catches but does not re-raise exceptions, meaning errors in the main processing loop are already swallowed.

While this pattern may be intentional for graceful degradation (collecting partial results even on errors), it can make debugging difficult if cleanup failures go unnoticed.

Consider one of these approaches:

Preferred: Move the return outside the finally block and explicitly handle the return in both success and exception paths:

except Exception as e: if self.monitor: self.monitor.update_memory_status(f"QUEUE_ERROR: {str(e)}") finally: # Clean up memory_monitor.cancel() if self.monitor: self.monitor.stop() - return results + + return results

Alternative: If you intend to suppress cleanup exceptions, document this behavior with a comment explaining why exceptions are swallowed and that partial results are intentionally returned.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return results

except Exception as e:

if self.monitor:

self.monitor.update_memory_status(f"QUEUE_ERROR: {str(e)}")

finally:

# Clean up

memory_monitor.cancel()

if self.monitor:

self.monitor.stop()

return results

🧰 Tools

🪛 Ruff (0.13.2)

468-468: return inside finally blocks cause exceptions to be silenced

(B012)

🤖 Prompt for AI Agents

In crawl4ai/async_dispatcher.py around line 468, the function returns from inside the finally block which can silently swallow exceptions raised during cleanup (memory_monitor.cancel() at ~465 or self.monitor.stop() at ~467) and exceptions from the main processing loop are already being caught without re-raising; move the return statement out of the finally block so the function explicitly returns in the normal-success path and in the except path (or re-raise after logging) instead of from finally, and ensure cleanup exceptions are either logged and re-raised or explicitly documented if intentionally suppressed (add a comment explaining suppression behavior if you choose to swallow cleanup errors).

coderabbitai · 2025-10-03T05:33:03Z

docs/md_v2/assets/page_actions.css

+.page-actions-overlay {
+    display: none;
+    position: fixed;
+    top: 0;
+    left: 0;
+    right: 0;
+    bottom: 0;
+    background: rgba(0, 0, 0, 0.5);
+    z-index: 998;
+    opacity: 0;
+    transition: opacity 0.2s ease;
+}
+
+.page-actions-overlay.active {
+    display: block;
+    opacity: 1;
+}
+
+@media (max-width: 768px) {
+    .page-actions-overlay {
+        display: block;
+    }
+}


⚠️ Potential issue | 🟠 Major

Prevent the overlay from blocking all taps on mobile

On mobile, the media query forces .page-actions-overlay to display: block even when it’s not active. Because pointer events stay enabled, the invisible overlay sits above the page and swallows every tap, so users can’t interact with underlying content. Please disable pointer events while the overlay is inactive and re-enable them only for the active state.

.page-actions-overlay { display: none; position: fixed; top: 0; left: 0; right: 0; bottom: 0; background: rgba(0, 0, 0, 0.5); z-index: 998; opacity: 0; transition: opacity 0.2s ease; + pointer-events: none; } .page-actions-overlay.active { display: block; opacity: 1; + pointer-events: auto; }

🤖 Prompt for AI Agents

In docs/md_v2/assets/page_actions.css around lines 317 to 339, the overlay is forced to display:block on small screens but still blocks taps when inactive; update the CSS so the default/inactive .page-actions-overlay has pointer-events: none (so it won’t swallow touches) and .page-actions-overlay.active has pointer-events: auto to restore interaction, making sure the media query does not override the inactive pointer-events setting.

docs/md_v2/assets/page_actions.js

coderabbitai · 2025-10-03T05:33:03Z

docs/md_v2/marketplace/admin/admin.js

+    async apiCall(endpoint, options = {}) {
+        const response = await fetch(`${API_BASE}${endpoint}`, {
+            ...options,
+            headers: {
+                'Authorization': `Bearer ${this.token}`,
+                'Content-Type': 'application/json',
+                ...options.headers
+            }
+        });
+
+        if (response.status === 401) {
+            this.logout();
+            throw { status: 401 };
+        }
+
+        if (!response.ok) throw new Error(`API Error: ${response.status}`);
+        return response.json();
+    }


⚠️ Potential issue | 🟠 Major

Handle 204/empty responses before calling .json().

apiCall unconditionally parses JSON; any admin DELETE (or other 204 responses) will throw Unexpected end of JSON input, preventing deletions. Guard against empty bodies before calling response.json().

- if (!response.ok) throw new Error(`API Error: ${response.status}`); - return response.json(); + if (!response.ok) throw new Error(`API Error: ${response.status}`); + + if (response.status === 204) { + return null; + } + + const text = await response.text(); + return text ? JSON.parse(text) : null;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async apiCall(endpoint, options = {}) {

const response = await fetch(`${API_BASE}${endpoint}`, {

...options,

headers: {

'Authorization': `Bearer ${this.token}`,

'Content-Type': 'application/json',

...options.headers

}

});

if (response.status === 401) {

this.logout();

throw { status: 401 };

}

if (!response.ok) throw new Error(`API Error: ${response.status}`);

return response.json();

}

async apiCall(endpoint, options = {}) {

const response = await fetch(`${API_BASE}${endpoint}`, {

...options,

headers: {

'Authorization': `Bearer ${this.token}`,

'Content-Type': 'application/json',

...options.headers

}

});

if (response.status === 401) {

this.logout();

throw { status: 401 };

}

if (!response.ok) throw new Error(`API Error: ${response.status}`);

if (response.status === 204) {

return null;

}

const text = await response.text();

return text ? JSON.parse(text) : null;

}

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/admin/admin.js around lines 146 to 163, apiCall currently always calls response.json(), which fails on 204/empty responses; update it to detect empty responses before parsing: after checking response.ok, if response.status === 204 or response.headers.get('Content-Length') === '0' (or the content-type header is missing/non-JSON) return null (or an appropriate empty value) instead of calling response.json(); otherwise call and return response.json(); keep existing 401 handling and error throw behavior.

coderabbitai · 2025-10-03T05:33:03Z

docs/md_v2/marketplace/app-detail.js

+    async init() {
+        if (!this.appSlug) {
+            window.location.href = 'index.html';
+            return;
+        }
+
+        await this.loadAppDetails();
+        this.setupEventListeners();
+        await this.loadRelatedApps();
+    }
+


⚠️ Potential issue | 🔴 Critical

Stop init flow when app data fails to load.

If loadAppDetails() cannot resolve an app, it redirects but leaves this.appData as null. Execution continues and loadRelatedApps() immediately dereferences this.appData.category, throwing TypeError: Cannot read properties of null. Add an early return after loading (and/or guard inside loadRelatedApps) so the rest of the pipeline only runs when this.appData is populated.

async init() { if (!this.appSlug) { window.location.href = 'index.html'; return; } - await this.loadAppDetails(); + await this.loadAppDetails(); + if (!this.appData) { + return; + } this.setupEventListeners(); await this.loadRelatedApps(); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async init() {

if (!this.appSlug) {

window.location.href = 'index.html';

return;

}

await this.loadAppDetails();

this.setupEventListeners();

await this.loadRelatedApps();

}

async init() {

if (!this.appSlug) {

window.location.href = 'index.html';

return;

}

await this.loadAppDetails();

if (!this.appData) {

return;

}

this.setupEventListeners();

await this.loadRelatedApps();

}

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/app-detail.js around lines 16 to 26, the init flow continues after loadAppDetails even when this.appData is null, causing loadRelatedApps to dereference this.appData.category and throw; after awaiting this.loadAppDetails() add a guard that returns early if this.appData is falsy (or alternatively update loadRelatedApps to first check for this.appData and return/no-op if missing) so setupEventListeners and loadRelatedApps only run when this.appData is populated.

coderabbitai · 2025-10-03T05:33:04Z

docs/md_v2/marketplace/backend/database.py

+    def __init__(self, db_path=None, schema_path='schema.yaml'):
+        self.schema = self._load_schema(schema_path)
+        # Use provided path or fallback to schema default
+        self.db_path = db_path or self.schema['database']['name']
+        self.conn = None
+        self._init_database()
+
+    def _load_schema(self, path: str) -> Dict:
+        with open(path, 'r') as f:
+            return yaml.safe_load(f)
+
+    def _init_database(self):
+        """Auto-create/migrate database from schema"""
+        self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
+        self.conn.row_factory = sqlite3.Row
+
+        for table_name, table_def in self.schema['tables'].items():
+            self._create_or_update_table(table_name, table_def['columns'])
+


⚠️ Potential issue | 🔴 Critical

Make schema and database paths location-agnostic

DatabaseManager() uses whatever the current working directory happens to be. Running python docs/md_v2/marketplace/backend/dummy_data.py from the project root raises FileNotFoundError: 'schema.yaml', so the brand-new seeder/demo can’t even start. Resolve both paths relative to this module so callers (including FastAPI and the CLI seeder) don’t have to cd into the backend folder first.

def __init__(self, db_path=None, schema_path='schema.yaml'): - self.schema = self._load_schema(schema_path) - # Use provided path or fallback to schema default - self.db_path = db_path or self.schema['database']['name'] + schema_path = Path(schema_path) + if not schema_path.is_absolute(): + schema_path = Path(__file__).resolve().parent / schema_path + self.schema = self._load_schema(schema_path) + + default_db = db_path or self.schema['database']['name'] + db_path = Path(default_db) + if not db_path.is_absolute(): + db_path = schema_path.parent / db_path + db_path.parent.mkdir(parents=True, exist_ok=True) + self.db_path = str(db_path)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, db_path=None, schema_path='schema.yaml'):

self.schema = self._load_schema(schema_path)

# Use provided path or fallback to schema default

self.db_path = db_path or self.schema['database']['name']

self.conn = None

self._init_database()

def _load_schema(self, path: str) -> Dict:

with open(path, 'r') as f:

return yaml.safe_load(f)

def _init_database(self):

"""Auto-create/migrate database from schema"""

self.conn = sqlite3.connect(self.db_path, check_same_thread=False)

self.conn.row_factory = sqlite3.Row

for table_name, table_def in self.schema['tables'].items():

self._create_or_update_table(table_name, table_def['columns'])

def __init__(self, db_path=None, schema_path='schema.yaml'):

- self.schema = self._load_schema(schema_path)

- # Use provided path or fallback to schema default

# Resolve schema_path relative to this module if not absolute

schema_path = Path(schema_path)

if not schema_path.is_absolute():

schema_path = Path(__file__).resolve().parent / schema_path

self.schema = self._load_schema(schema_path)

# Determine db_path (override or from schema) and resolve it

default_db = db_path or self.schema['database']['name']

db_path = Path(default_db)

if not db_path.is_absolute():

db_path = schema_path.parent / db_path

# Ensure parent dirs exist

db_path.parent.mkdir(parents=True, exist_ok=True)

self.db_path = str(db_path)

self.conn = None

self._init_database()

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/backend/database.py around lines 8 to 26, the class currently resolves schema and database files relative to the current working directory which causes FileNotFoundError when run from project root; change resolution to be module-location-agnostic by resolving schema_path against the module directory (e.g. Path(__file__).parent / schema_path) before opening it, and if db_path is not provided, resolve the schema's database name relative to the same module directory; ensure the code uses the resolved Path (or its string) when opening the schema and when connecting with sqlite3 so absolute paths from callers remain untouched and relative names become relative to the backend module.

coderabbitai · 2025-10-03T05:33:04Z

docs/md_v2/marketplace/backend/server.py

+    where_clauses = []
+    if category:
+        where_clauses.append(f"category = '{category}'")
+    if type:
+        where_clauses.append(f"type = '{type}'")
+    if featured is not None:
+        where_clauses.append(f"featured = {1 if featured else 0}")
+    if sponsored is not None:
+        where_clauses.append(f"sponsored = {1 if sponsored else 0}")
+
+    where = " AND ".join(where_clauses) if where_clauses else None
+    apps = db.get_all('apps', limit=limit, offset=offset, where=where)
+
+    # Parse JSON fields
+    for app in apps:
+        if app.get('screenshots'):
+            app['screenshots'] = json.loads(app['screenshots'])
+
+    return json_response(apps)
+
+@app.get("/api/apps/{slug}")
+async def get_app(slug: str):
+    """Get single app by slug"""
+    apps = db.get_all('apps', where=f"slug = '{slug}'", limit=1)
+    if not apps:
+        raise HTTPException(status_code=404, detail="App not found")
+
+    app = apps[0]
+    if app.get('screenshots'):
+        app['screenshots'] = json.loads(app['screenshots'])
+
+    return json_response(app)
+
+@app.get("/api/articles")
+async def get_articles(
+    category: Optional[str] = None,
+    limit: int = Query(default=20, le=10000),
+    offset: int = Query(default=0)
+):
+    """Get articles with optional category filter"""
+    where = f"category = '{category}'" if category else None
+    articles = db.get_all('articles', limit=limit, offset=offset, where=where)
+
+    # Parse JSON fields
+    for article in articles:
+        if article.get('related_apps'):
+            article['related_apps'] = json.loads(article['related_apps'])
+        if article.get('tags'):
+            article['tags'] = json.loads(article['tags'])
+
+    return json_response(articles)
+
+@app.get("/api/articles/{slug}")
+async def get_article(slug: str):
+    """Get single article by slug"""
+    articles = db.get_all('articles', where=f"slug = '{slug}'", limit=1)
+    if not articles:
+        raise HTTPException(status_code=404, detail="Article not found")
+
+    article = articles[0]
+    if article.get('related_apps'):
+        article['related_apps'] = json.loads(article['related_apps'])
+    if article.get('tags'):
+        article['tags'] = json.loads(article['tags'])
+
+    return json_response(article)
+
+@app.get("/api/categories")
+async def get_categories():
+    """Get all categories ordered by index"""
+    categories = db.get_all('categories', limit=50)
+    categories.sort(key=lambda x: x.get('order_index', 0))
+    return json_response(categories, cache_time=7200)
+
+@app.get("/api/sponsors")
+async def get_sponsors(active: Optional[bool] = True):
+    """Get sponsors, default active only"""
+    where = f"active = {1 if active else 0}" if active is not None else None
+    sponsors = db.get_all('sponsors', where=where, limit=20)
+
+    # Filter by date if active
+    if active:
+        now = datetime.now().isoformat()
+        sponsors = [s for s in sponsors
+                   if (not s.get('start_date') or s['start_date'] <= now) and
+                      (not s.get('end_date') or s['end_date'] >= now)]
+
+    return json_response(sponsors)


⚠️ Potential issue | 🔴 Critical

Stop interpolating request data into SQL

get_apps, get_app, get_articles, get_article, and get_sponsors build the WHERE clause by f-stringing user-supplied values directly into SQL (category, type, slug, etc.). That enables trivial SQL injection on every public endpoint. Please switch to parameterized queries (e.g., add a params argument to DatabaseManager.get_all and bind values with ? placeholders) before releasing.

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/backend/server.py around lines 56-143 the handlers build SQL WHERE clauses by interpolating request values (category, type, slug, featured, sponsored, active) with f-strings which allows SQL injection; change all callers to use parameterized queries by updating DatabaseManager.get_all to accept a params list and use ? placeholders in the WHERE clause instead of string interpolation, then in each endpoint build the WHERE string with placeholders (e.g. "category = ?" etc.) and pass the corresponding values in order (convert booleans to 1/0 or use proper typed params), ensure slug/other values are passed as params rather than concatenated, and adjust any JSON/boolean handling to remain the same while ensuring no user data is directly injected into SQL.

coderabbitai · 2025-10-03T05:33:04Z

docs/md_v2/marketplace/frontend/marketplace.js

+    async loadMainContent() {
+        // Load apps column
+        const apps = await this.api.getApps({ limit: 8 });
+        if (apps && apps.length) {
+            const appsGrid = document.getElementById('apps-grid');
+            appsGrid.innerHTML = apps.map(app => `
+                <div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                    <div class="app-compact-header">
+                        <span>${app.category}</span>
+                        <span>★ ${app.rating}/5</span>
+                    </div>
+                    <div class="app-compact-title">${app.name}</div>
+                    <div class="app-compact-desc">${app.description}</div>
+                </div>
+            `).join('');
+        }
+
+        // Load articles column
+        const articles = await this.api.getArticles({ limit: 6 });
+        if (articles && articles.length) {
+            const articlesList = document.getElementById('articles-list');
+            articlesList.innerHTML = articles.map(article => `
+                <div class="article-compact" onclick="marketplace.showArticle('${article.id}')">
+                    <div class="article-meta">
+                        <span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>
+                    </div>
+                    <div class="article-title">${article.title}</div>
+                    <div class="article-author">by ${article.author}</div>
+                </div>
+            `).join('');
+        }
+
+        // Load trending
+        if (apps && apps.length) {
+            const trending = apps.slice(0, 5);
+            const trendingList = document.getElementById('trending-list');
+            trendingList.innerHTML = trending.map((app, i) => `
+                <div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                    <div class="trending-rank">${i + 1}</div>
+                    <div class="trending-info">
+                        <div class="trending-name">${app.name}</div>
+                        <div class="trending-stats">${app.downloads} downloads</div>
+                    </div>
+                </div>
+            `).join('');
+        }
+
+        // Load more apps grid
+        const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
+        if (moreApps && moreApps.length) {
+            const moreGrid = document.getElementById('more-apps-grid');
+            moreGrid.innerHTML = moreApps.map(app => `
+                <div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                    <div class="app-compact-header">
+                        <span>${app.category}</span>
+                        <span>${app.type}</span>
+                    </div>
+                    <div class="app-compact-title">${app.name}</div>
+                </div>
+            `).join('');
+        }
+    }


⚠️ Potential issue | 🟠 Major

Wire category/type filters into the data fetches.

filterByCategory and the type dropdown update this.currentCategory/this.currentType, but loadMainContent (and loadMoreApps) still call /apps with only limit/offset. The UI shows active filters, yet the results never change. Please pass the selected filters into every apps fetch so the listings honor user choices.

- const apps = await this.api.getApps({ limit: 8 }); + const baseParams = { + limit: 8, + ...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}), + ...(this.currentType ? { type: this.currentType } : {}), + }; + const apps = await this.api.getApps(baseParams); ... - const moreApps = await this.api.getApps({ offset: 8, limit: 12 }); + const moreApps = await this.api.getApps({ + offset: 8, + limit: 12, + ...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}), + ...(this.currentType ? { type: this.currentType } : {}), + });

Be sure to apply the same filtered params in loadMoreApps() as well.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async loadMainContent() {

// Load apps column

const apps = await this.api.getApps({ limit: 8 });

if (apps && apps.length) {

const appsGrid = document.getElementById('apps-grid');

appsGrid.innerHTML = apps.map(app => `

<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '"')})">

<div class="app-compact-header">

<span>${app.category}</span>

<span>★ ${app.rating}/5</span>

</div>

<div class="app-compact-title">${app.name}</div>

<div class="app-compact-desc">${app.description}</div>

</div>

`).join('');

}

// Load articles column

const articles = await this.api.getArticles({ limit: 6 });

if (articles && articles.length) {

const articlesList = document.getElementById('articles-list');

articlesList.innerHTML = articles.map(article => `

<div class="article-compact" onclick="marketplace.showArticle('${article.id}')">

<div class="article-meta">

<span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>

</div>

<div class="article-title">${article.title}</div>

<div class="article-author">by ${article.author}</div>

</div>

`).join('');

}

// Load trending

if (apps && apps.length) {

const trending = apps.slice(0, 5);

const trendingList = document.getElementById('trending-list');

trendingList.innerHTML = trending.map((app, i) => `

<div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '"')})">

<div class="trending-rank">${i + 1}</div>

<div class="trending-info">

<div class="trending-name">${app.name}</div>

<div class="trending-stats">${app.downloads} downloads</div>

</div>

</div>

`).join('');

}

// Load more apps grid

const moreApps = await this.api.getApps({ offset: 8, limit: 12 });

if (moreApps && moreApps.length) {

const moreGrid = document.getElementById('more-apps-grid');

moreGrid.innerHTML = moreApps.map(app => `

<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '"')})">

<div class="app-compact-header">

<span>${app.category}</span>

<span>${app.type}</span>

</div>

<div class="app-compact-title">${app.name}</div>

</div>

`).join('');

}

}

async loadMainContent() {

// Load apps column

const baseParams = {

limit: 8,

...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}),

...(this.currentType ? { type: this.currentType } : {}),

};

const apps = await this.api.getApps(baseParams);

if (apps && apps.length) {

const appsGrid = document.getElementById('apps-grid');

appsGrid.innerHTML = apps.map(app => `

<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '"')})">

<div class="app-compact-header">

<span>${app.category}</span>

<span>★ ${app.rating}/5</span>

</div>

<div class="app-compact-title">${app.name}</div>

<div class="app-compact-desc">${app.description}</div>

</div>

`).join('');

}

// Load articles column

const articles = await this.api.getArticles({ limit: 6 });

if (articles && articles.length) {

const articlesList = document.getElementById('articles-list');

articlesList.innerHTML = articles.map(article => `

<div class="article-compact" onclick="marketplace.showArticle('${article.id}')">

<div class="article-meta">

<span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>

</div>

<div class="article-title">${article.title}</div>

<div class="article-author">by ${article.author}</div>

</div>

`).join('');

}

// Load trending

if (apps && apps.length) {

const trending = apps.slice(0, 5);

const trendingList = document.getElementById('trending-list');

trendingList.innerHTML = trending.map((app, i) => `

<div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '"')})">

<div class="trending-rank">${i + 1}</div>

<div class="trending-info">

<div class="trending-name">${app.name}</div>

<div class="trending-stats">${app.downloads} downloads</div>

</div>

</div>

`).join('');

}

// Load more apps grid

const moreApps = await this.api.getApps({

offset: 8,

limit: 12,

...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}),

...(this.currentType ? { type: this.currentType } : {}),

});

if (moreApps && moreApps.length) {

const moreGrid = document.getElementById('more-apps-grid');

moreGrid.innerHTML = moreApps.map(app => `

<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '"')})">

<div class="app-compact-header">

<span>${app.category}</span>

<span>${app.type}</span>

</div>

<div class="app-compact-title">${app.name}</div>

</div>

`).join('');

}

}

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/frontend/marketplace.js around lines 214 to 275, loadMainContent is calling this.api.getApps without passing the current filters so UI-selected category/type are ignored; update every this.api.getApps call in this file (including the later loadMoreApps) to include filter params built from this.currentCategory and this.currentType (e.g. add category and type fields to the options object only when they are set) so that the initial apps, trending slice, and the more-apps fetch all respect the active filters.

coderabbitai · 2025-10-03T05:33:04Z

docs/md_v2/marketplace/marketplace.js

+            const imageUrl = hero.image || '';
+            heroCard.innerHTML = `
+                <div class="hero-image" ${imageUrl ? `style="background-image: url('${imageUrl}')"` : ''}>
+                    ${!imageUrl ? `[${hero.category || 'APP'}]` : ''}
+                </div>
+                <div class="hero-content">
+                    <span class="hero-badge">${hero.type || 'PAID'}</span>
+                    <h2 class="hero-title">${hero.name}</h2>
+                    <p class="hero-description">${hero.description}</p>
+                    <div class="hero-meta">
+                        <span>★ ${hero.rating || 0}/5</span>
+                        <span>${hero.downloads || 0} downloads</span>
+                    </div>
+                </div>
+            `;
+            heroCard.onclick = () => this.showAppDetail(hero);
+        }
+
+        // Secondary featured cards
+        const secondary = document.getElementById('featured-secondary');
+        secondary.innerHTML = '';
+        if (featured.length > 1) {
+            featured.slice(1, 4).forEach(app => {
+                const card = document.createElement('div');
+                card.className = 'secondary-card';
+                const imageUrl = app.image || '';
+                card.innerHTML = `
+                    <div class="secondary-image" ${imageUrl ? `style="background-image: url('${imageUrl}')"` : ''}>
+                        ${!imageUrl ? `[${app.category || 'APP'}]` : ''}
+                    </div>
+                    <div class="secondary-content">
+                        <h3 class="secondary-title">${app.name}</h3>
+                        <p class="secondary-desc">${(app.description || '').substring(0, 100)}...</p>
+                        <div class="secondary-meta">
+                            <span>${app.type || 'Open Source'}</span> · <span>★ ${app.rating || 0}/5</span>
+                        </div>
+                    </div>
+                `;
+                card.onclick = () => this.showAppDetail(app);
+                secondary.appendChild(card);
+            });
+        }
+    }
+
+    async loadSponsors() {
+        const sponsors = await this.api.getSponsors();
+        if (!sponsors || !sponsors.length) {
+            // Show placeholder if no sponsors
+            const container = document.getElementById('sponsored-content');
+            container.innerHTML = `
+                <div class="sponsor-card">
+                    <h4>Become a Sponsor</h4>
+                    <p>Reach thousands of developers using Crawl4AI</p>
+                    <a href="mailto:[email protected]">Contact Us →</a>
+                </div>
+            `;
+            return;
+        }
+
+        const container = document.getElementById('sponsored-content');
+        container.innerHTML = sponsors.slice(0, 5).map(sponsor => `
+            <div class="sponsor-card">
+                <h4>${sponsor.company_name}</h4>
+                <p>${sponsor.tier} Sponsor - Premium Solutions</p>
+                <a href="${sponsor.landing_url}" target="_blank">Learn More →</a>
+            </div>
+        `).join('');


⚠️ Potential issue | 🔴 Critical

Sanitize API data before injecting with innerHTML

Multiple sections write API-provided strings (e.g., hero.description, app.description, article titles) straight into innerHTML. If any marketplace entry contains <script> or similar markup, this becomes an XSS vector. Please render these fields with textContent/DOM builders or run them through a trusted sanitizer before inserting into the DOM.

Also applies to: 214-259, 319-355

coderabbitai · 2025-10-03T05:33:04Z

docs/md_v2/marketplace/marketplace.js

+    async loadMainContent() {
+        // Load apps column
+        const apps = await this.api.getApps({ limit: 8 });
+        if (apps && apps.length) {
+            const appsGrid = document.getElementById('apps-grid');
+            appsGrid.innerHTML = apps.map(app => `
+                <div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                    <div class="app-compact-header">
+                        <span>${app.category}</span>
+                        <span>★ ${app.rating}/5</span>
+                    </div>
+                    <div class="app-compact-title">${app.name}</div>
+                    <div class="app-compact-desc">${app.description}</div>
+                </div>
+            `).join('');
+        }
+
+        // Load articles column
+        const articles = await this.api.getArticles({ limit: 6 });
+        if (articles && articles.length) {
+            const articlesList = document.getElementById('articles-list');
+            articlesList.innerHTML = articles.map(article => `
+                <div class="article-compact" onclick="marketplace.showArticle('${article.id}')">
+                    <div class="article-meta">
+                        <span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>
+                    </div>
+                    <div class="article-title">${article.title}</div>
+                    <div class="article-author">by ${article.author}</div>
+                </div>
+            `).join('');
+        }
+
+        // Load trending
+        if (apps && apps.length) {
+            const trending = apps.slice(0, 5);
+            const trendingList = document.getElementById('trending-list');
+            trendingList.innerHTML = trending.map((app, i) => `
+                <div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                    <div class="trending-rank">${i + 1}</div>
+                    <div class="trending-info">
+                        <div class="trending-name">${app.name}</div>
+                        <div class="trending-stats">${app.downloads} downloads</div>
+                    </div>
+                </div>
+            `).join('');
+        }
+
+        // Load more apps grid
+        const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
+        if (moreApps && moreApps.length) {
+            const moreGrid = document.getElementById('more-apps-grid');
+            moreGrid.innerHTML = moreApps.map(app => `
+                <div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                    <div class="app-compact-header">
+                        <span>${app.category}</span>
+                        <span>${app.type}</span>
+                    </div>
+                    <div class="app-compact-title">${app.name}</div>
+                </div>
+            `).join('');
+        }
+    }


⚠️ Potential issue | 🟠 Major

Category/type filters never applied

loadMainContent and loadMoreApps ignore this.currentCategory/this.currentType, so clicking any filter or changing the type dropdown does nothing—the UI always shows the unfiltered feed. Please pass the active filters into every getApps call (initial grid, “more apps”, and load-more pagination) and reset paging counters when the filter changes.

- const apps = await this.api.getApps({ limit: 8 }); + const appParams = { limit: 8 }; + if (this.currentCategory && this.currentCategory !== 'all') { + appParams.category = this.currentCategory; + } + if (this.currentType) { + appParams.type = this.currentType; + } + const apps = await this.api.getApps(appParams); … - const moreApps = await this.api.getApps({ offset: 8, limit: 12 }); + const moreAppsParams = { offset: 8, limit: 12 }; + if (this.currentCategory && this.currentCategory !== 'all') { + moreAppsParams.category = this.currentCategory; + } + if (this.currentType) { + moreAppsParams.type = this.currentType; + } + const moreApps = await this.api.getApps(moreAppsParams); … - const moreApps = await this.api.getApps({ offset: this.loadedApps, limit: 12 }); + const params = { offset: this.loadedApps, limit: 12 }; + if (this.currentCategory && this.currentCategory !== 'all') { + params.category = this.currentCategory; + } + if (this.currentType) { + params.type = this.currentType; + } + const moreApps = await this.api.getApps(params);

Also applies to: 358-376

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/marketplace.js around lines 214-275 (and also apply same fix at 358-376), loadMainContent and load-more logic call this.api.getApps without respect to this.currentCategory and this.currentType and never reset paging; update every getApps call to include the active filters (e.g. pass { category: this.currentCategory, type: this.currentType, limit, offset }) so the initial apps, trending, and moreApps queries are filtered, ensure load-more/pagination reuses the same filter object, and reset any paging/offset counters to 0 whenever the category or type filter changes so the UI shows filtered results from the first page.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

…develop

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

docs/md_v2/assets/page_actions.css (1)
305-327: Restore mobile tap access by disabling overlay pointer events when inactive

On mobile, the media query forces .page-actions-overlay to display: block even when it isn’t active. Because pointer events stay enabled, this transparent overlay sits on top of the page and swallows all taps, breaking navigation. Default the overlay to pointer-events: none and only enable them in the active state.
 .page-actions-overlay {
     display: none;
     position: fixed;
     top: 0;
     left: 0;
     right: 0;
     bottom: 0;
     background: rgba(0, 0, 0, 0.5);
     z-index: 998;
     opacity: 0;
     transition: opacity 0.2s ease;
+    pointer-events: none;
 }
 
 .page-actions-overlay.active {
     display: block;
     opacity: 1;
+    pointer-events: auto;
 }
docs/md_v2/marketplace/marketplace.js (3)
162-175: Sanitize API data before injecting with innerHTML.

The hero card renders hero.name and hero.description directly into innerHTML. If the API returns malicious markup (e.g., <script> tags), this creates an XSS vector. Use textContent for plain text fields or sanitize HTML before rendering.

236-245: Sanitize API data before injecting with innerHTML.

Similar to the hero card, the apps grid injects app.name, app.category, and app.description directly into innerHTML, creating XSS vulnerabilities. Apply sanitization or use safer DOM methods.

Also applies to: 252-260, 267-275, 282-290, 348-357, 363-371

231-292: Apply category and type filters to API calls.

The loadMainContent method ignores this.currentCategory and this.currentType, so filter buttons and type dropdown have no effect. Pass these parameters to every getApps call.

Apply this diff:
-        const apps = await this.api.getApps({ limit: 8 });
+        const params = { limit: 8 };
+        if (this.currentCategory && this.currentCategory !== 'all') {
+            params.category = this.currentCategory;
+        }
+        if (this.currentType) {
+            params.type = this.currentType;
+        }
+        const apps = await this.api.getApps(params);
Also applies to: 279-279, 377-377
docs/md_v2/marketplace/backend/server.py (1)
97-108: Stop interpolating request data into SQL.

The get_apps handler builds WHERE clauses by f-stringing user-supplied values (category, type, featured, sponsored) directly into SQL, enabling trivial SQL injection. Switch to parameterized queries.

Update DatabaseManager.get_all to accept a params list and use ? placeholders:
def get_all(self, table: str, limit: int = 100, offset: int = 0, where: str = None, params: list = None) -> List[Dict]:
    cursor = self.conn.cursor()
    query = f"SELECT * FROM {table}"
    query_params = []
    if where:
        query += f" WHERE {where}"
        if params:
            query_params.extend(params)
    query += f" LIMIT ? OFFSET ?"
    query_params.extend([limit, offset])
    cursor.execute(query, query_params)
    rows = cursor.fetchall()
    return [dict(row) for row in rows]
Then in the handler:
where_clauses = []
params = []
if category:
    where_clauses.append("category = ?")
    params.append(category)
if type:
    where_clauses.append("type = ?")
    params.append(type)
# ... and so on
where = " AND ".join(where_clauses) if where_clauses else None
apps = db.get_all('apps', limit=limit, offset=offset, where=where, params=params)
Also applies to: 120-120, 137-137, 152-152, 176-176
docs/md_v2/marketplace/admin/admin.js (1)
190-213: Handle 204/empty responses before calling .json().

The apiCall method unconditionally parses JSON. DELETE operations (or other 204 responses) will throw "Unexpected end of JSON input", preventing deletions from working.

Apply this fix:
         if (response.status === 401) {
             this.logout();
             throw { status: 401 };
         }

         if (!response.ok) throw new Error(`API Error: ${response.status}`);
-        return response.json();
+
+        if (response.status === 204 || response.headers.get('Content-Length') === '0') {
+            return null;
+        }
+
+        const text = await response.text();
+        return text ? JSON.parse(text) : null;
     }

🧹 Nitpick comments (2)

docs/md_v2/marketplace/marketplace.js (1)
237-237: Avoid inline onclick with stringified data.

Using JSON.stringify().replace(/"/g, '"') for inline handlers is fragile. Consider using event delegation or data attributes to avoid potential XSS and improve maintainability.

Example using data attributes:
-                <div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                <div class="app-compact" data-app-id="${app.id}">
Then attach a delegated click handler:
document.getElementById('apps-grid').addEventListener('click', (e) => {
    const card = e.target.closest('[data-app-id]');
    if (card) {
        const app = this.data.apps.find(a => a.id == card.dataset.appId);
        if (app) this.showAppDetail(app);
    }
});
Also applies to: 268-268, 283-283
docs/md_v2/marketplace/admin/admin.js (1)
289-291: Consider event delegation for table actions.

While using numeric IDs in inline onclick handlers is safe here, event delegation would be more maintainable and eliminate any future XSS risk if string data is used.

Example:
// In setupEventListeners or after rendering:
document.getElementById('apps-table').addEventListener('click', (e) => {
    if (e.target.matches('.btn-edit')) {
        const row = e.target.closest('tr');
        const id = parseInt(row.dataset.id, 10);
        this.editItem('apps', id);
    }
    // Similar for duplicate and delete
});
Then update the table row:
<tr data-id="${app.id}">
    ...
    <button class="btn-edit">Edit</button>
Also applies to: 327-329, 361-362, 400-401

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9900f63 and 611d48f.

📒 Files selected for processing (11)

docs/blog/release-v0.7.4.md (0 hunks)
docs/md_v2/assets/page_actions.css (1 hunks)
docs/md_v2/assets/page_actions.js (1 hunks)
docs/md_v2/marketplace/admin/admin.css (1 hunks)
docs/md_v2/marketplace/admin/admin.js (1 hunks)
docs/md_v2/marketplace/admin/index.html (1 hunks)
docs/md_v2/marketplace/backend/server.py (1 hunks)
docs/md_v2/marketplace/backend/uploads/.gitignore (1 hunks)
docs/md_v2/marketplace/marketplace.css (1 hunks)
docs/md_v2/marketplace/marketplace.js (1 hunks)
mkdocs.yml (4 hunks)

💤 Files with no reviewable changes (1)

docs/blog/release-v0.7.4.md

🚧 Files skipped from review as they are similar to previous changes (4)

docs/md_v2/marketplace/admin/index.html
docs/md_v2/marketplace/admin/admin.css
docs/md_v2/assets/page_actions.js
mkdocs.yml

🧰 Additional context used

🧬 Code graph analysis (3)

docs/md_v2/marketplace/admin/admin.js (4)

docs/md_v2/marketplace/marketplace.js (4)

window (2-9)

window (3-3)

origin (5-5)

resolveAssetUrl (11-18)

docs/md_v2/marketplace/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/marketplace.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/marketplace.js (4)

docs/md_v2/marketplace/admin/admin.js (2)

window (24-24)

resolveAssetUrl (39-46)

docs/md_v2/marketplace/frontend/marketplace.js (3)

CACHE_TTL (3-3)

API_BASE (2-2)

marketplace (392-392)

docs/md_v2/marketplace/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/frontend/app-detail.js (1)

API_BASE (2-2)

docs/md_v2/marketplace/backend/server.py (2)

docs/md_v2/marketplace/backend/database.py (3)

DatabaseManager (7-117)

get_all (80-89)

search (91-113)

docs/md_v2/marketplace/backend/config.py (1)

Config (30-59)

🪛 Ruff (0.13.3)

docs/md_v2/marketplace/backend/server.py

222-222: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

231-231: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

307-307: Possible SQL injection vector through string-based query construction

(S608)

310-310: Consider moving this statement to an else block

(TRY300)

311-311: Do not catch blind exception: Exception

(BLE001)

312-312: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

325-325: Possible SQL injection vector through string-based query construction

(S608)

326-326: Consider [*list(app_data.values()), app_id] instead of concatenation

Replace with [*list(app_data.values()), app_id]

(RUF005)

328-328: Consider moving this statement to an else block

(TRY300)

329-329: Do not catch blind exception: Exception

(BLE001)

330-330: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

352-352: Possible SQL injection vector through string-based query construction

(S608)

355-355: Consider moving this statement to an else block

(TRY300)

356-356: Do not catch blind exception: Exception

(BLE001)

357-357: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

369-369: Possible SQL injection vector through string-based query construction

(S608)

370-370: Consider [*list(article_data.values()), article_id] instead of concatenation

Replace with [*list(article_data.values()), article_id]

(RUF005)

372-372: Consider moving this statement to an else block

(TRY300)

373-373: Do not catch blind exception: Exception

(BLE001)

374-374: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

395-395: Possible SQL injection vector through string-based query construction

(S608)

398-398: Consider moving this statement to an else block

(TRY300)

399-399: Do not catch blind exception: Exception

(BLE001)

400-400: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

412-412: Possible SQL injection vector through string-based query construction

(S608)

413-413: Consider [*list(category_data.values()), cat_id] instead of concatenation

Replace with [*list(category_data.values()), cat_id]

(RUF005)

415-415: Consider moving this statement to an else block

(TRY300)

416-416: Do not catch blind exception: Exception

(BLE001)

417-417: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

427-427: Consider moving this statement to an else block

(TRY300)

428-428: Do not catch blind exception: Exception

(BLE001)

429-429: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

439-439: Possible SQL injection vector through string-based query construction

(S608)

442-442: Consider moving this statement to an else block

(TRY300)

443-443: Do not catch blind exception: Exception

(BLE001)

444-444: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

452-452: Possible SQL injection vector through string-based query construction

(S608)

453-453: Consider [*list(sponsor_data.values()), sponsor_id] instead of concatenation

Replace with [*list(sponsor_data.values()), sponsor_id]

(RUF005)

455-455: Consider moving this statement to an else block

(TRY300)

456-456: Do not catch blind exception: Exception

(BLE001)

457-457: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

467-467: Consider moving this statement to an else block

(TRY300)

468-468: Do not catch blind exception: Exception

(BLE001)

469-469: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (7)

docs/md_v2/marketplace/backend/server.py (3)

230-252: LGTM: Secure file upload implementation.

The upload endpoint properly validates folder whitelist, content type, and file size. The secure filename generation using timestamp and token_hex prevents path traversal and collision attacks.

332-338: LGTM: Parameterized DELETE query.

The delete endpoint uses parameterized queries correctly, preventing SQL injection.

Also applies to: 376-382, 420-429, 460-469

63-83: LGTM: Robust type coercion utility.

The to_int helper safely handles various input types (bool, int, float, str) with proper fallbacks. Good defensive programming.

docs/md_v2/marketplace/admin/admin.js (4)

2-37: LGTM: Flexible API origin configuration.

The API configuration logic supports URL parameter overrides with localStorage persistence, providing flexibility for development and testing scenarios. The fallback logic is robust.

648-693: LGTM: Well-structured save flow with image upload.

The saveItem method properly handles sponsor logo upload before saving, collects form data, and refreshes the UI after successful operations. Error handling is appropriate.

744-799: LGTM: Comprehensive file upload UI handling.

The logo upload handlers properly manage state transitions, show previews using FileReader, and handle both existing and new file scenarios. Good UX implementation.

813-824: LGTM: Referential integrity check before deletion.

The deleteCategory method prevents deletion of categories that have associated apps, maintaining data integrity. This is good defensive programming.

coderabbitai · 2025-10-09T05:01:32Z

docs/md_v2/marketplace/backend/server.py

+        cursor.execute(f"INSERT INTO apps ({columns}) VALUES ({placeholders})",
+                      list(app_data.values()))
+        db.conn.commit()
+        return {"id": cursor.lastrowid, "message": "App created"}
+    except Exception as e:
+        raise HTTPException(status_code=400, detail=str(e))


⚠️ Potential issue | 🟠 Major

Fix SQL construction and exception handling.

While line 307 uses ? placeholders for values, constructing the column list from dict keys is still risky if those keys could be influenced by user input. Additionally, catching bare Exception and raising without chaining loses context.

Apply this diff to improve exception handling:

cursor.execute(f"INSERT INTO apps ({columns}) VALUES ({placeholders})", list(app_data.values())) db.conn.commit() return {"id": cursor.lastrowid, "message": "App created"} except Exception as e: - raise HTTPException(status_code=400, detail=str(e)) + raise HTTPException(status_code=400, detail=str(e)) from e

Also applies to: 329-330, 356-357, 373-374, 399-400, 416-417, 443-444, 456-457, 468-469

🧰 Tools

🪛 Ruff (0.13.3)

307-307: Possible SQL injection vector through string-based query construction

(S608)

310-310: Consider moving this statement to an else block

(TRY300)

311-311: Do not catch blind exception: Exception

(BLE001)

312-312: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

coderabbitai · 2025-10-09T05:01:32Z

docs/md_v2/marketplace/backend/server.py

+        cursor.execute(f"UPDATE apps SET {set_clause} WHERE id = ?",
+                      list(app_data.values()) + [app_id])
+        db.conn.commit()
+        return {"message": "App updated"}
+    except Exception as e:
+        raise HTTPException(status_code=400, detail=str(e))


⚠️ Potential issue | 🟠 Major

Use parameterized UPDATE and improve list construction.

Line 325 builds the SET clause from dict keys, which is risky. Additionally, line 326 concatenates lists instead of using unpacking.

Apply these fixes:

set_clause = ', '.join([f"{k} = ?" for k in app_data.keys()]) cursor = db.conn.cursor() cursor.execute(f"UPDATE apps SET {set_clause} WHERE id = ?", - list(app_data.values()) + [app_id]) + [*list(app_data.values()), app_id]) db.conn.commit() return {"message": "App updated"} except Exception as e: - raise HTTPException(status_code=400, detail=str(e)) + raise HTTPException(status_code=400, detail=str(e)) from e

Also applies to: 369-374, 412-417, 452-457

🧰 Tools

🪛 Ruff (0.13.3)

325-325: Possible SQL injection vector through string-based query construction

(S608)

326-326: Consider [*list(app_data.values()), app_id] instead of concatenation

Replace with [*list(app_data.values()), app_id]

(RUF005)

328-328: Consider moving this statement to an else block

(TRY300)

329-329: Do not catch blind exception: Exception

(BLE001)

330-330: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents

In docs/md_v2/marketplace/backend/server.py around lines 325-330, the UPDATE builds the SET clause unsafely from dict keys and constructs the parameter list by concatenating lists; change to build a parameterized SET clause like ", ".join(f"{k} = ?" for k in app_data.keys()) and construct the parameters using unpacking or append so the order matches the keys (e.g. params = [*app_data.values(), app_id] or params = list(app_data.values()); params.append(app_id)), then call cursor.execute("UPDATE apps SET {set_clause} WHERE id = ?", params) and commit; apply the same fix pattern to the other occurrences at lines 369-374, 412-417, and 452-457.

emmanuel-ferdman and others added 30 commits May 13, 2025 00:04

fix(docker-api): migrate to modern datetime library API

1e1c887

Signed-off-by: Emmanuel Ferdman <[email protected]>

Merge branch 'main' into main

8e3c411

Fix examples in README.md

7a8190e

fix(deep-crawl): BestFirst priority inversion; remove pre-scoring tru…

88a9fbb

…ncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

docs: Update URL seeding examples to use proper async context managers

ecbe5ff

- Wrap all AsyncUrlSeeder usage with async context managers - Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

fix(crawler): Removed the incorrect reference in browser_config varia…

f4a4328

…ble #1310

Merge pull request #1398 from unclecode/fix/update-url-seeding-docs

dad7c51

Update URL seeding examples to use proper async context managers

docs: update Docker instructions to use the latest release tag

9447054

Merge pull request #1369 from NezarAli/main

f4206d6

Fix examples in README.md

Merge pull request #1104 from emmanuel-ferdman/main

ef174a4

fix(docker-api): migrate to modern datetime library API

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

69961cf

…develop

Merge pull request #1422 from unclecode/fix/docker-llmEnvFile

8bb0e68

fix(docker): Fix LLM API key handling for multi-provider support

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

90af453

…develop

fix(utils): Improve URL normalization by avoiding quote/unquote to pr…

40ab287

…eserve '+' signs. ref #1332

fix(logger): ensure logger is a Logger instance in crawling strategie…

38f3ea4

…s. ref #1437

Merge pull request #1440 from unclecode/feature/docker-llm-parameters

4fe2d01

feat(docker): Add temperature and base_url parameters for LLM configuration

Merge pull request #1426 from unclecode/fix/update-quickstart-and-ada…

cce3390

…ptive-strategies-docs Update Quickstart and Adaptive Strategies documentation

Merge pull request #1436 from unclecode/fix/docker-filter

4e1c4bd

fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy

Remove deprecated test for 'proxy' parameter in BrowserConfig and upd…

4ed33fc

…ate .gitignore to include test_scripts directory.

feat: add preserve_https_for_internal_links flag to maintain HTTPS du…

f566c5a

…ring crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

feat: update documentation for preserve_https_for_internal_links. ref #…

bdacf61

…1410

ntohidi and others added 9 commits September 16, 2025 15:45

Merge pull request #1389 from unclecode/fix/deep-crawl-scoring

23431d8

fix(deep-crawl): BestFirst priority inversion

Merge pull request #1464 from unclecode/fix/proxy_deprecation

3899ac3

Fix/proxy deprecation

feat(StealthAdapter): fix stealth features for Playwright integration…

77559f3

…. ref #1481

Merge pull request #1501 from unclecode/fix/n-playwright-stealth

d0eb5a6

feat(StealthAdapter): fix stealth features for Playwright integration

#1505 fix(api): update config handling to only set base config if not…

a1950af

… provided by user

Merge pull request #1508 from unclecode/docker/base_config_overrides

69e8ca3

#1505 fix(api): update config handling to only set base config if not provided by user

Merge branch 'feature/docker-hooks' into develop

fef715a

fix(docker-deployment): replace console.log with print for metadata e…

3fe49a7

…xtraction

Release v0.7.5: The Update

361499d

- Updated version to 0.7.5 - Added comprehensive demo and release notes - Updated documentation

ntohidi requested a review from unclecode September 29, 2025 10:11

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

ntohidi and others added 10 commits September 30, 2025 11:54

refactor(release): remove memory management section for cleaner docum…

70af81d

…entation. ref #1443

Update gitignore add local scripts folder

ef46df1

fix: remove this import as it causes python to treat "json" as a vari…

8d30662

…able in the except block

fix: always return a list, even if we catch an exception

35dd206

fix(marketplace): Update URLs to use /marketplace path and relative A…

749d200

…PI endpoints - Change API_BASE to relative '/api' for production - Move marketplace to /marketplace instead of /marketplace/frontend - Update MkDocs navigation - Fix logo path in marketplace index

Merge pull request #1530 from Sjoeborg/fix/arun-many-returns-none

80aa6c1

Fix: run_urls() returns None, crashing arun_many()

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

9292b26

…develop

Merge pull request #1531 from unclecode/develop

9900f63

Marketplace and brand book changes

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

unclecode and others added 6 commits October 3, 2025 20:11

fix(docs): hide copy menu on non-markdown pages

5145d42

feat(marketplace): add sponsor logo uploads

8c62277

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

feat(docs): add chatgpt quick link to page actions

d2c7f34

fix(marketplace): align admin api with backend endpoints

2c373f0

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

936397e

…develop

Merge branch 'develop' into release/v0.7.5

611d48f

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

		provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
		api_token = llm_config_dict.get('api_token') if llm_config_dict else None

Uh oh!

Release/v0.7.5 #1527

Are you sure you want to change the base?

Release/v0.7.5 #1527

Uh oh!

Conversation

ntohidi commented Sep 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update

🎯 What's New

🔧 Docker Hooks System

🤖 Enhanced LLM Integration

🔒 HTTPS Preservation

🐍 Python 3.10+ Support

🛠️ Bug Fixes & Improvements

Major Fixes

Community-Reported Issues

🔄 Breaking Changes

📁 Files Changed

New Files

Updated Files

🧪 Testing

📚 Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 3, 2025

ntohidi commented Sep 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 29, 2025 •

edited

Loading