Skip to content

Conversation

ntohidi
Copy link
Collaborator

@ntohidi ntohidi commented Sep 29, 2025

🚀 Crawl4AI v0.7.5: The Docker Hooks & Security Update

This PR introduces Crawl4AI v0.7.5 with major new features focused on extensibility and security.

🎯 What's New

🔧 Docker Hooks System

  • Complete pipeline customization with user-provided Python functions
  • hook points: on_browser_created, on_page_context_created, before_goto, after_goto, on_user_agent_updated, on_execution_started, before_retrieve_html, before_return_html
  • Safe execution with AST validation, timeout protection, and error isolation
  • Real working examples with authentication, performance optimization, and content processing

🤖 Enhanced LLM Integration

  • Custom provider support (OpenAI, Anthropic, Gemini, local models)
  • Temperature and base_url configuration for fine-tuned control
  • Multi-provider environment variable support
  • Docker API integration with enhanced LLM parameters

🔒 HTTPS Preservation

  • New preserve_https_for_internal_links=True flag
  • Maintains secure protocols throughout crawling
  • Supports modern web security requirements
  • Prevents authentication cookie loss and security warnings

🐍 Python 3.10+ Support

  • Dropped Python 3.9 support for modern features in the documentation

🛠️ Bug Fixes & Improvements

Major Fixes

Community-Reported Issues

  • Multiple GitHub issues and Discord feedback addressed
  • Enhanced proxy configuration and error handling
  • Improved dependency management and compatibility

🔄 Breaking Changes

  • Python 3.10+ Required: Upgrade from Python 3.9
  • Proxy Parameter Deprecated: Use new proxy_config structure
  • New Dependency: Added cssselect for better CSS handling

📁 Files Changed

New Files

  • docs/releases_review/demo_v0.7.5.py - Working demo showcasing all new features
  • docs/blog/release-v0.7.5.md - Complete release notes

Updated Files

  • README.md - Added v0.7.5 highlights and updated version references
  • docs/md_v2/blog/index.md - Updated blog index with latest release
  • crawl4ai/__version__.py - Version bump to 0.7.5

🧪 Testing

  • ✅ Working demo created with real examples
  • ✅ All new features tested with live URLs (httpbin.org, quotes.toscrape.com)
  • ✅ Docker hooks system validated with actual API calls
  • ✅ HTTPS preservation tested with real sites
  • ✅ LLM integration verified with multiple providers

📚 Documentation

  • Complete release notes with real examples
  • Working demo file that users can run
  • Updated README with version highlights
  • Blog index updated for visibility

Summary by CodeRabbit

  • New Features

    • Docker Hooks System (8 hook points) and per-request hooks support; Marketplace and website-to-API example projects; enhanced LLM controls (per-request provider, temperature, base_url).
  • Improvements

    • Option to preserve HTTPS for internal links; stealth browsing refinements; streaming/unified crawl flows; proxy_config-first handling; timezone-aware timestamps.
  • Bug Fixes

    • URL normalization, serialization, and API error-handling fixes.
  • Documentation

    • v0.7.5 release notes, tutorials, examples, demos, and playground updates.
  • Deprecations / Breaking Changes

    • Python >=3.10 required; proxy string deprecated (use proxy_config); added cssselect dependency.

emmanuel-ferdman and others added 30 commits May 13, 2025 00:04
Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377
…ncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error
Update URL seeding examples to use proper async context managers
Fix examples in README.md
fix(docker-api): migrate to modern datetime library API
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291
fix(docker): Fix LLM API key handling for multi-provider support
…ted examples (#1330)

- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED
This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode
Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (#1368) and merged via (437395e).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405
… deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly
…ration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.
feat(docker): Add temperature and base_url parameters for LLM configuration
…ptive-strategies-docs

Update Quickstart and Adaptive Strategies documentation
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.
fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy
…and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.
…ate .gitignore to include test_scripts directory.
…ring crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
ntohidi and others added 9 commits September 16, 2025 15:45
fix(deep-crawl): BestFirst priority inversion
feat(StealthAdapter): fix stealth features for Playwright integration
#1505 fix(api): update config handling to only set base config if not provided by user
- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation
@ntohidi ntohidi requested a review from unclecode September 29, 2025 10:11
Copy link
Contributor

coderabbitai bot commented Sep 29, 2025

Walkthrough

Bumps release to v0.7.5 and introduces: Docker user hooks system, per-request LLM temperature/base_url propagation, HTTPS-preservation for internal internal links, Playwright stealth adapter, proxy deprecation/autoconversion, serialization/logging fixes, many docs/examples/tests, and Python 3.10+ requirement.

Changes

Cohort / File(s) Summary
Version & packaging
crawl4ai/__version__.py, pyproject.toml, setup.py, requirements.txt, README.md, .gitignore
Bump version to 0.7.5; require Python >=3.10; add cssselect; update docs/README release headings; expand .gitignore (add *.db, .env*, test_scripts/, scripts/).
HTTPS preservation
crawl4ai/utils.py, crawl4ai/async_webcrawler.py, crawl4ai/content_scraping_strategy.py, crawl4ai/async_configs.py, CHANGELOG.md, docs/md_v2/core/deep-crawling.md, tests/test_preserve_https_for_internal_links.py
Add preserve_https_for_internal_links flag across configs; propagate original_scheme; extend normalize_url* APIs to optionally preserve HTTPS for internal links; docs and tests added.
Playwright stealth
crawl4ai/browser_adapter.py, crawl4ai/browser_manager.py
Add StealthAdapter, apply stealth per page, simplify startup/teardown, and deprecate legacy stealth context handling.
Proxy deprecation & parsing
crawl4ai/async_configs.py, crawl4ai/browser_manager.py, docs/md_v2/advanced/proxy-security.md, tests/proxy/test_proxy_deprecation.py, tests/async/test_0.4.2_browser_manager.py, tests/memory/test_docker_config_gen.py
Deprecate string proxy, prefer proxy_config; add warnings and automatic conversion; enhance proxy parsing; update docs and tests.
Docker Hooks system
deploy/docker/hook_manager.py, deploy/docker/server.py, deploy/docker/api.py, deploy/docker/static/playground/index.html, docs/examples/docker_hooks_examples.py, docs/releases_review/demo_v0.7.5.py, tests/docker/test_hooks_client.py, tests/docker/test_hooks_comprehensive.py
New hook manager module and integration: validate/compile/execute user hooks with timeouts/isolation; server and API accept hooks_config, return hooks_info; UI/examples/tests added.
LLM parameter propagation (Docker)
deploy/docker/api.py, deploy/docker/job.py, deploy/docker/schemas.py, deploy/docker/utils.py, deploy/docker/.llm.env.example, deploy/docker/config.yml, deploy/docker/README.md, docs/md_v2/core/docker-deployment.md, tests/docker/test_llm_params.py
Add per-request/provider temperature and base_url overrides; propagate through API/job flows; update schema models, env examples, utils resolution and tests.
Adaptive / embedding config
crawl4ai/adaptive_crawler.py, crawl4ai/async_configs.py, docs/md_v2/core/adaptive-crawling.md, docs/examples/adaptive_crawling/llm_config_example.py, tests/adaptive/test_llm_embedding.py
Accept LLMConfig instances or dicts for embedding config; add compatibility helpers; update examples and tests.
Deep-crawling scoring & logging
crawl4ai/deep_crawling/bff_strategy.py, crawl4ai/deep_crawling/bfs_strategy.py, tests/general/test_bff_scoring.py
Ensure real logging.Logger instances, invert score sign for priority queue semantics, adjust metadata scoring and tests.
Filters & serialization
crawl4ai/deep_crawling/filters.py, tests/docker/test_filter_deep_crawl.py
Expose constructor inputs (patterns, use_glob, reverse) on URLPatternFilter for serialization; add E2E tests.
Serialization & model output fixes
crawl4ai/async_configs.py, crawl4ai/models.py, crawl4ai/async_crawler_strategy.back.py
Avoid serializing private slots, include markdown in CrawlResult serialization, coerce deprecated props, and fix verbose flag reference.
Docker server: timezones & streaming
deploy/docker/api.py, deploy/docker/server.py, deploy/docker/static/playground/index.html, tests/docker/test_server_requests.py
Use timezone-aware timestamps, centralize streaming flow, include hooks_info in responses, adjust streaming headers and UI toggles, add tests.
New examples & website-to-api
docs/examples/website-to-api/*, docs/examples/adaptive_crawling/*, docs/releases_review/demo_v0.7.5.py
Add full Web Scraper API example (FastAPI + UI + library + tests), adaptive LLM examples, and release demo scripts.
Docs, release notes & website
README.md, CHANGELOG.md, docs/**, mkdocs.yml
Add v0.7.5 release notes and blog, document hooks, LLM config, HTTPS flag, proxy deprecation; add marketplace docs, assets, and nav updates.
Tests & CI coverage
tests/**
Many new/updated tests covering hooks, LLM params, proxy deprecation, deep crawl scoring, serialization, streaming, adaptive embedding demos.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant AsyncWebCrawler
  participant ContentStrategy as ContentScrapingStrategy
  participant Utils as utils.normalize_url

  Client->>AsyncWebCrawler: arun(url, **kwargs)
  AsyncWebCrawler->>AsyncWebCrawler: original_scheme = urlparse(url).scheme
  AsyncWebCrawler->>ContentStrategy: aprocess_html(..., original_scheme, preserve_https)
  ContentStrategy->>Utils: normalize_url(href, base_url, preserve_https, original_scheme)
  Utils-->>ContentStrategy: normalized URL (HTTPS preserved for internal links)
  ContentStrategy-->>AsyncWebCrawler: processed content
  AsyncWebCrawler-->>Client: results
Loading
sequenceDiagram
  participant Client
  participant Server
  participant HookMgr as UserHookManager
  participant Crawler
  participant Strategy

  Client->>Server: POST /crawl {urls, hooks.code, ...}
  Server->>HookMgr: validate + compile hooks
  HookMgr-->>Server: compiled hooks, errors
  Server->>Crawler: create + attach hooks (wrapped)
  Crawler->>Strategy: run crawl (hooks invoked around lifecycle)
  Strategy-->>Crawler: results + hook logs
  Server-->>Client: response {results, hooks_info, status}
Loading
sequenceDiagram
  participant Manager as BrowserManager
  participant Playwright
  participant Page
  participant Stealth as StealthAdapter

  Manager->>Playwright: start()
  Manager->>Playwright: browser.new_page()
  Playwright-->>Manager: Page
  Manager->>Stealth: apply_stealth(Page)
  Stealth-->>Manager: ok / warn
  Manager-->>Caller: Page ready (stealth if enabled)
Loading
sequenceDiagram
  participant Client
  participant Server
  participant API as deploy/docker/api.py
  participant Utils as deploy/docker/utils.py
  participant LLM

  Client->>Server: POST /md {provider, temperature, base_url}
  Server->>API: handle_markdown_request(...)
  API->>Utils: get_llm_api_key/provider/temp/base_url
  Utils-->>API: resolved params (env or payload)
  API->>LLM: request with provider + overrides
  LLM-->>API: content
  API-->>Client: Markdown JSON
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • 2025 MAY Bug Fixes #1175 — Modifies normalize_url in crawl4ai/utils.py; likely adjacent to the HTTPS-preservation changes here.

Suggested reviewers

  • unclecode

Poem

A rabbit taps its keys with cheer,
New hooks hop in, the signals clear.
HTTPS keeps its hat of blue,
Stealthy pages slip right through.
LLMs warm their tea—v0.7.5, hooray! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The description provides a thorough account of new features, bug fixes, breaking changes, files changed, testing, and documentation updates, but it does not follow the repository’s required template headings and omits the mandatory checklist section. Please update the description to include the exact template headings—“## Summary”, “## List of files changed and why”, “## How Has This Been Tested?”, and “## Checklist:”—and complete the checklist to confirm adherence to style guidelines, documentation updates, and testing requirements.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title “Release/v0.7.5” directly indicates the primary intent of the changeset, which is bumping the project to version 0.7.5. It succinctly conveys the main update without extraneous details.
Docstring Coverage ✅ Passed Docstring coverage is 83.72% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch release/v0.7.5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 34

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
deploy/docker/c4ai-doc-context.md (1)

8723-8735: Keep the example aligned with the current arun API

This snippet passes extraction_strategy=strategy directly to crawler.arun() while also creating a CrawlerRunConfig() that doesn’t carry the strategy. In v0.7.x the recommended (and documented earlier in this same file) pattern is to put the strategy inside the run config; the top-level arun() no longer accepts extraction_strategy and will raise a TypeError. Please move the strategy into CrawlerRunConfig to keep the example executable and consistent with the guidance above. Suggested fix:

-    config = CrawlerRunConfig()
+    config = CrawlerRunConfig(extraction_strategy=strategy)
...
-        result = await crawler.arun(
-            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
-            extraction_strategy=strategy,
-            config=config
-        )
+        result = await crawler.arun(
+            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
+            config=config
+        )
deploy/docker/utils.py (1)

22-41: Harden load_config() against missing llm key.

Avoid KeyError when config.yml lacks an llm section.

-    with open(config_path, "r") as config_file:
-        config = yaml.safe_load(config_file)
+    with open(config_path, "r") as config_file:
+        config = yaml.safe_load(config_file) or {}
+    config.setdefault("llm", {})
crawl4ai/async_configs.py (1)

550-586: Round‑trip gaps in BrowserConfig.from_kwargs.

viewport and sleep_on_close are emitted in to_dict() but not accepted in from_kwargs(), breaking round‑trip via dump/load/clone.

         return BrowserConfig(
@@
-            viewport_width=kwargs.get("viewport_width", 1080),
-            viewport_height=kwargs.get("viewport_height", 600),
+            viewport_width=kwargs.get("viewport_width", 1080),
+            viewport_height=kwargs.get("viewport_height", 600),
+            viewport=kwargs.get("viewport"),
@@
-            extra_args=kwargs.get("extra_args", []),
+            extra_args=kwargs.get("extra_args", []),
+            sleep_on_close=kwargs.get("sleep_on_close", False),

Also applies to: 588-624

deploy/docker/api.py (1)

456-466: Force non‑stream mode for this endpoint.

If caller sets stream=True in crawler_config, arun_many will return an async generator and await partial_func() will fail. Explicitly disable streaming here.

         browser_config = BrowserConfig.load(browser_config)
-        crawler_config = CrawlerRunConfig.load(crawler_config)
+        crawler_config = CrawlerRunConfig.load(crawler_config)
+        # Non-streaming endpoint must return materialized results
+        if getattr(crawler_config, "stream", False):
+            crawler_config.stream = False
docs/md_v2/core/docker-deployment.md (1)

61-71: Fix release version references to 0.7.5.

This section still instructs users to pull the 0.7.3 image and states that latest points to 0.7.3. For the 0.7.5 release this is incorrect and will cause users to run the stale build. Please update the version numbers and accompanying note to 0.7.5 before publishing these docs.

crawl4ai/utils.py (2)

1791-1812: Last-attempt raise introduces behavior change; dead branch below.

In RateLimitError handling you now re-raise on the last attempt (Line 1793), but the later else branch (Lines 1804–1811) is unreachable. This both changes the prior contract (callers likely expect an error payload) and leaves dead code. Prefer one clear policy: either return a structured error or re-raise consistently.

Proposed fix: keep non-throwing contract and remove the dead branch.

@@
-        except RateLimitError as e:
-            print("Rate limit error:", str(e))
-            if attempt == max_attempts - 1:
-                # Last attempt failed, raise the error.
-                raise
-
-            # Check if we have exhausted our max attempts
-            if attempt < max_attempts - 1:
-                # Calculate the delay and wait
-                delay = base_delay * (2**attempt)  # Exponential backoff formula
-                print(f"Waiting for {delay} seconds before retrying...")
-                time.sleep(delay)
-            else:
-                # Return an error response after exhausting all retries
-                return [
-                    {
-                        "index": 0,
-                        "tags": ["error"],
-                        "content": ["Rate limit error. Please try again later."],
-                    }
-                ]
+        except RateLimitError as e:
+            print("Rate limit error:", str(e))
+            if attempt == max_attempts - 1:
+                # Exhausted: return structured error (preserve prior behavior)
+                return [{
+                    "index": 0,
+                    "tags": ["error"],
+                    "content": ["Rate limit error. Please try again later."]
+                }]
+            # Exponential backoff
+            delay = base_delay * (2**attempt)
+            print(f"Waiting for {delay} seconds before retrying...")
+            time.sleep(delay)

2126-2142: Duplicate normalize_url definition causes confusion.

There are two normalize_url functions; Python will keep the latter one (Lines 2146+). The earlier simple version (Lines 2126–2142) is dead code and misleading.

-def normalize_url(href, base_url):
-    """Normalize URLs to ensure consistent format"""
-    from urllib.parse import urljoin, urlparse
-
-    # Parse base URL to get components
-    parsed_base = urlparse(base_url)
-    if not parsed_base.scheme or not parsed_base.netloc:
-        raise ValueError(f"Invalid base URL format: {base_url}")
-    
-    if  parsed_base.scheme.lower() not in ["http", "https"]:
-        # Handle special protocols
-        raise ValueError(f"Invalid base URL format: {base_url}")
-    cleaned_href = href.strip()
-
-    # Use urljoin to handle all cases
-    return urljoin(base_url, cleaned_href)
+# (Removed duplicate normalize_url; single extended version lives below)
crawl4ai/adaptive_crawler.py (1)

1476-1480: Links mutation assumes dict; breaks when Links object is returned.

Here you subscript result.links as a dict unconditionally. Elsewhere you correctly branch by type. This will raise at runtime if links is a model object.

-            # Filter our all links do not have head_date
-            if hasattr(result, 'links') and result.links:
-                result.links['internal'] = [link for link in result.links['internal'] if link.get('head_data')]
-                # For now let's ignore external links without head_data
-                # result.links['external'] = [link for link in result.links['external'] if link.get('head_data')]
+            # Filter out links without head_data
+            if hasattr(result, 'links') and result.links:
+                if isinstance(result.links, dict):
+                    internal = [l for l in result.links.get('internal', []) if l.get('head_data')]
+                    result.links['internal'] = internal
+                    # Optionally filter external similarly
+                    # result.links['external'] = [l for l in result.links.get('external', []) if l.get('head_data')]
+                else:
+                    # Links object with .internal/.external lists of Link models
+                    result.links.internal = [l for l in result.links.internal if getattr(l, 'head_data', None)]
+                    # Optionally: result.links.external = [l for l in result.links.external if getattr(l, 'head_data', None)]
🧹 Nitpick comments (61)
docs/examples/website-to-api/requirements.txt (1)

1-5: Pin example dependencies to tested versions.

Leaving these requirements unconstrained makes the demo brittle—any future major release of FastAPI, Uvicorn, Pydantic, or LiteLLM (or even crawl4ai itself) can introduce breaking changes and silently break the example. Please lock each dependency to the exact versions you validated for v0.7.5 (or at least constrain to compatible ranges) so users can reproduce the documented behavior.

CHANGELOG.md (1)

8-17: Align changelog entry with the 0.7.5 release

This branch is publishing v0.7.5, but the new note lives under a fresh “Unreleased” section, while another “Unreleased” still exists further down. Please fold this into a ## [0.7.5] - 2025-09-29 section (or move the flag under the upcoming release heading) so the changelog isn’t left with duplicate Unreleased buckets.

tests/docker/test_filter_deep_crawl.py (5)

11-11: Parameterize BASE_URL (avoid hard-coded port).

Read from env with a sane default so CI/dev runs don't depend on port 11234 being used.

-BASE_URL = "http://localhost:11234/"  # Adjust port as needed
+import os
+BASE_URL = os.getenv("C4A_BASE_URL", "http://localhost:8000/")

21-74: Use try/except/else for clearer success path.

Move the success print/return into an else: block to satisfy TRY300 and clarify flow.

-    try:
+    try:
         async with Crawl4aiDockerClient(
             base_url=BASE_URL,
             verbose=True,
         ) as client:
             ...
-        print("\n✅ Docker client test completed successfully!")
-        return True
+    except httpx.HTTPError as e:
+        print(f"❌ Docker client test failed (HTTP): {e}")
+        import traceback; traceback.print_exc()
+        return False
+    except Exception as e:
+        print(f"❌ Docker client test failed: {e}")
+        import traceback; traceback.print_exc()
+        return False
+    else:
+        print("\n✅ Docker client test completed successfully!")
+        return True

75-79: Avoid blind except Exception; catch httpx.HTTPError first.

Narrowing exceptions improves debuggability and addresses BLE001.

-    except Exception as e:
+    except httpx.HTTPError as e:
+        print(f"❌ REST API test failed (HTTP): {e}")
+        import traceback; traceback.print_exc()
+        return False
+    except Exception as e:
         print(f"❌ REST API test failed: {e}")
         import traceback
         traceback.print_exc()
         return False

Also applies to: 151-155


45-51: Prefer assertions over prints if this is meant for CI.

If this file lives under tests/, replace prints with asserts (pytest + pytest-asyncio). If it is a demo, consider moving it to docs/ or scripts/ to avoid CI discovery confusion.

Would you like a pytest-asyncio version with proper assertions and markers?

Also applies to: 124-131


52-70: Clarify result typing (list vs object vs stream).

Instead of hasattr checks, branch on known types: list, AsyncGenerator, or CrawlResult. This avoids surprising paths.

Also applies to: 132-146

deploy/docker/utils.py (5)

74-90: Silence ARG001 and document intent.

provider is unused by design; rename to _provider and note that litellm resolves env vars.

-def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> Optional[str]:
+def get_llm_api_key(config: Dict, _provider: Optional[str] = None) -> Optional[str]:
     """Get the appropriate API key based on the LLM provider.
@@
-    # Return None - litellm will automatically find the right environment variable
+    # Return None - litellm will automatically find the right environment variable
     return None

92-109: Validation always returns True — consider a debug log.

Returning (True, "") defers validation to litellm; add a debug log when no direct key is present to aid ops.


57-62: Use typing.Any in datetime_handler signature.

Minor typing nit.

-from typing import Dict, Optional
+from typing import Any, Dict, Optional
@@
-def datetime_handler(obj: any) -> Optional[str]:
+def datetime_handler(obj: Any) -> Optional[str]:

111-146: Consider config-based fallback for temperature/base_url.

Optional: if env vars are absent, read config.get("llm", {}).get("temperature"/"base_url") before returning None.

Do you want me to wire this in and update deploy/docker/api.py callers accordingly?

Also applies to: 148-172


174-181: Catch specific DNS errors (avoid blind except).

Narrow exception scope and avoid unused variable.

-    except Exception as e:
-        return False
+    except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer, dns.resolver.NoNameservers, dns.exception.DNSException):
+        return False
crawl4ai/async_configs.py (5)

261-294: Proxy string parsing: add IPv6 safety and tests.

The split-on-":" logic will misparse IPv6 literals. Consider using urllib.parse for URL forms and a regex for colon-forms, or require scheme for IPv6. Add tests for:

  • http(s)://user:pass@host:port
  • socks5://host:port
  • ip:port and ip:port:user:pass
  • IPv6 [::1]:8080 with/without scheme

512-517: Typo: fa_user_agenr_generatorfa_user_agent_generator.

Minor readability fix.

-        fa_user_agenr_generator = ValidUAGenerator()
+        fa_user_agent_generator = ValidUAGenerator()
         if self.user_agent_mode == "random":
-            self.user_agent = fa_user_agenr_generator.generate(
+            self.user_agent = fa_user_agent_generator.generate(
                 **(self.user_agent_generator_config or {})
             )

1454-1462: Don’t raise on deprecated attrs; warn instead.

Raising in __setattr__ breaks backward compatibility whenever callers pass these kwargs. Emit a deprecation warning and keep setting (or translate to cache_mode) for one release.

-        if name in self._UNWANTED_PROPS and value is not all_params[name].default:
-            raise AttributeError(f"Setting '{name}' is deprecated. {self._UNWANTED_PROPS[name]}")
+        if name in self._UNWANTED_PROPS and value is not all_params[name].default:
+            warnings.warn(f"'{name}' is deprecated. {self._UNWANTED_PROPS[name]}", UserWarning, stacklevel=2)

Also applies to: 1069-1074


1502-1503: Use PAGE_TIMEOUT constant in from_kwargs.

Keep defaults consistent with __init__.

-            page_timeout=kwargs.get("page_timeout", 60000),
+            page_timeout=kwargs.get("page_timeout", PAGE_TIMEOUT),

223-347: Consolidate ProxyConfig definitions
Remove the duplicate ProxyConfig class in crawl4ai/proxy_strategy.py and have that module re-export ProxyConfig from crawl4ai/async_configs.py (e.g. via from .async_configs import ProxyConfig) with a deprecation warning stub, so there’s a single source-of-truth and no drift.

deploy/docker/hook_manager.py (6)

19-31: Annotate mutable class attributes as ClassVar to satisfy linters and intent.

Prevents them from being treated as instance attributes.

-from typing import Dict, Callable, Optional, Tuple, List, Any
+from typing import Dict, Callable, Optional, Tuple, List, Any, ClassVar
@@
-    HOOK_SIGNATURES = {
+    HOOK_SIGNATURES: ClassVar[Dict[str, List[str]]] = {
@@
-    DEFAULT_TIMEOUT = 30
+    DEFAULT_TIMEOUT: ClassVar[int] = 30

68-71: Nit: remove unnecessary f-string.

No placeholders present.

-                return False, f"Hook function must be async (use 'async def' instead of 'def')"
+                return False, "Hook function must be async (use 'async def' instead of 'def')"

168-176: Prefer logger.exception to capture stack traces for compilation failures.

Improves diagnostics.

-            logger.error(f"Hook compilation failed for {hook_point}: {str(e)}")
+            logger.exception("Hook compilation failed for %s: %s", hook_point, e)

322-327: Prefer logger.exception in unexpected wrapper errors.

Captures traceback for postmortem.

-                logger.error(f"Unexpected error in hook wrapper for {hook_point}: {e}")
+                logger.exception("Unexpected error in hook wrapper for %s: %s", hook_point, e)

494-499: Prefer logger.exception on attach failures; include exception context.

-        except Exception as e:
-            logger.error(f"Failed to attach hook to {hook_point}: {e}")
+        except Exception as e:
+            logger.exception("Failed to attach hook to %s: %s", hook_point, e)
             validation_errors.append({
                 'hook_point': hook_point,
                 'error': f'Failed to attach hook: {str(e)}'
             })

118-141: Security posture note: import allows escaping builtin restrictions.

Even with curated builtins, users can import builtins and regain open, etc. If this is intended (trusted hooks inside Docker), document clearly. If not, add an AST gate on Import/ImportFrom names (allowlist) or a custom import stub.

Would you like a follow-up patch to allowlist e.g. {"asyncio","json","re"} and block builtins, os, subprocess at AST-validation time?

Also applies to: 150-166

deploy/docker/api.py (4)

640-649: Normalize URL schemes in streaming path too.

Parity with non‑streaming handler avoids accidental scheme-less inputs breaking streaming.

-        # Attach hooks if provided
+        # Normalize URLs (add https:// when missing)
+        urls = [('https://' + u) if not u.startswith(('http://','https://')) and not u.startswith(("raw:", "raw://")) else u for u in urls]
+
+        # Attach hooks if provided

545-546: Use logger.exception for serialization/hook data errors.

Captures stack traces; easier ops triage.

-                logger.error(f"Error processing result: {e}")
+                logger.exception("Error processing result: %s", e)
@@
-                    logger.error(f"Hook data not JSON serializable: {e}")
+                    logger.exception("Hook data not JSON serializable: %s", e)
@@
-                logger.error(f"Serialization error: {e}")
+                logger.exception("Serialization error: %s", e)

Also applies to: 577-578, 426-427


352-359: Standardize timestamps to UTC ISO‑8601 with timezone.

Two paths write different timestamp flavors (local naive vs UTC-naive). Prefer UTC with offset or 'Z' for consistency.

-    from datetime import datetime
-    task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
+    from uuid import uuid4
+    task_id = f"llm_{uuid4().hex[:8]}"
@@
-        "created_at": datetime.now().isoformat(),
+        "created_at": datetime.now(timezone.utc).isoformat().replace("+00:00","Z"),
@@
-        "created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
+        "created_at": datetime.now(timezone.utc).isoformat().replace("+00:00","Z"),

If external clients parse created_at, confirm they accept the trailing 'Z'. I can adjust to keep the offset form if preferred.

Also applies to: 693-695


476-483: Import path robustness for in‑package modules.

from hook_manager import ... relies on working CWD/sys.path. Consider relative import (from .hook_manager import ...) for package safety, mirroring how deploy/docker is laid out.

Do we install deploy/docker as a package (so relative imports are valid), or run it as a script with PYTHONPATH hacks? I can adjust to match your runtime.

Also applies to: 642-649

crawl4ai/async_webcrawler.py (1)

357-357: Optional: move import to module scope.

Minor micro‑perf/readability.

-                    from urllib.parse import urlparse
+from urllib.parse import urlparse
docs/md_v2/api/parameters.md (1)

8-15: Keep the example aligned with the proxy deprecation

The table now marks proxy as deprecated, but the code snippet immediately above still demonstrates proxy= usage. Please switch the example to proxy_config so readers don’t copy a deprecated pattern.

 browser_cfg = BrowserConfig(
     browser_type="chromium",
     headless=True,
     viewport_width=1280,
     viewport_height=720,
-    proxy="http://user:pass@proxy:8080",
+    proxy_config={
+        "server": "http://proxy:8080",
+        "username": "user",
+        "password": "pass",
+    },
     user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
 )
docs/examples/website-to-api/static/styles.css (4)

56-66: Add visible focus states for keyboard users.

Interactive elements lack explicit focus-visible styling, which is an accessibility blocker for keyboard navigation. Add clear focus indicators to links and buttons.

+.nav-link:focus-visible {
+  outline: 2px solid #09b5a5;
+  outline-offset: 2px;
+}
+
+.extract-btn:focus-visible,
+.copy-btn:focus-visible,
+.save-btn:focus-visible,
+.btn-danger:focus-visible {
+  outline: 2px solid #09b5a5;
+  outline-offset: 2px;
+}

Also applies to: 210-229, 338-355, 581-599, 649-666


259-264: Avoid removing outlines on focus; replace with accessible outline.

Using outline: none removes a critical focus cue. Provide a replacement outline for accessibility.

-select:focus,
-.input-group select:focus,
-.option-group select:focus {
-    outline: none !important;
-    border-color: #09b5a5 !important;
-}
+select:focus,
+.input-group select:focus,
+.option-group select:focus {
+    outline: 2px solid #09b5a5 !important;
+    outline-offset: 2px !important;
+    border-color: #09b5a5 !important;
+}

299-311: Improve long-text wrapping in code blocks to preserve readability.

word-break: break-all splits tokens mid-word; prefer overflow-wrap:anywhere and word-break:normal.

 .api-request-box pre,
 .json-response-box pre {
   font-family: 'Courier New', monospace;
   font-size: 0.85rem;
   line-height: 1.5;
   color: #FFFFFF;
   background: #1A1A1A;
   padding: 1rem;
   border-radius: 4px;
   overflow-x: auto;
-  white-space: pre-wrap;
-  word-break: break-all;
+  white-space: pre-wrap;
+  word-break: normal;
+  overflow-wrap: anywhere;
 }
@@
 .request-curl pre {
   color: #CCCCCC;
   font-size: 0.8rem;
   line-height: 1.4;
   overflow-x: auto;
-  white-space: pre-wrap;
-  word-break: break-all;
+  white-space: pre-wrap;
+  word-break: normal;
+  overflow-wrap: anywhere;
   background: #111111;
   padding: 0.75rem;
   border-radius: 4px;
   border: 1px solid #333;
 }

Also applies to: 536-547


412-425: Respect reduced-motion preferences.

Offer a non-animated spinner for users with motion sensitivity.

 @keyframes spin {
     0% { transform: rotate(0deg); }
     100% { transform: rotate(360deg); }
 }
+
+@media (prefers-reduced-motion: reduce) {
+  .spinner {
+    animation: none;
+    border-top-color: #09b5a5; /* static indicator */
+  }
+}
docs/examples/website-to-api/api_server.py (4)

165-171: Align response serialization with Pydantic v2.

Serialize ScrapeResponse using model_dump().

-            response=response_data.dict()
+            response=response_data.model_dump() if hasattr(response_data, "model_dump") else response_data.dict()

Also applies to: 223-229


140-174: Slim down blanket exception handling and preserve cause.

Catching Exception broadly is noisy; where you keep it, chain the cause and move success return to an else: block for clarity.

-    try:
+    try:
         # Save the API request
         headers = {"Content-Type": "application/json"}
         body = {
             "url": str(request.url),
             "query": request.query,
             "model_name": request.model_name
         }
         
         result = await scraper_agent.scrape_data(
             url=str(request.url),
             query=request.query,
             model_name=request.model_name
         )
         
         response_data = ScrapeResponse(
             success=True,
             url=result["url"],
             query=result["query"],
             extracted_data=result["extracted_data"],
             schema_used=result["schema_used"],
             timestamp=result["timestamp"]
         )
-        
-        # Save the request with response
-        save_api_request(
-            endpoint="/scrape",
-            method="POST",
-            headers=headers,
-            body=body,
-            response=response_data.dict()
-        )
-        
-        return response_data
-    
-    except Exception as e:
+    except Exception as e:
         # Save the failed request
         headers = {"Content-Type": "application/json"}
         body = {
             "url": str(request.url),
             "query": request.query,
             "model_name": request.model_name
         }
         
         save_api_request(
             endpoint="/scrape",
             method="POST",
             headers=headers,
             body=body,
             response={"error": str(e)}
         )
-        
-        raise HTTPException(status_code=500, detail=f"Scraping failed: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Scraping failed: {e}") from e
+    else:
+        save_api_request(
+            endpoint="/scrape",
+            method="POST",
+            headers=headers,
+            body=body,
+            response=response_data.model_dump() if hasattr(response_data, "model_dump") else response_data.dict()
+        )
+        return response_data

Also applies to: 175-193


92-109: Avoid blocking the event loop with file I/O in request paths.

get_saved_requests() does directory scans and file reads on the event loop. Consider offloading to a thread or using aiofiles to keep the API responsive under load.


13-26: Consider enabling CORS for the example app.

If the static UI is served from a different origin, add CORS to simplify local demos.

 from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
@@
 app = FastAPI(
@@
 )
+
+# CORS (demo-friendly; tighten in prod)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
docs/examples/adaptive_crawling/llm_config_example.py (1)

12-14: AsyncWebCrawler(verbose=...) is likely ignored.

Constructor typically expects config via BrowserConfig; the verbose kwarg may be dropped. Either pass a config or omit it.

-    async with AsyncWebCrawler(verbose=False) as crawler:
+    async with AsyncWebCrawler() as crawler:
@@
-    async with AsyncWebCrawler(verbose=True) as crawler:
+    async with AsyncWebCrawler() as crawler:

If you want explicit verbosity:

from crawl4ai import BrowserConfig
async with AsyncWebCrawler(config=BrowserConfig(verbose=False)) as crawler:
    ...

Also applies to: 96-99

tests/docker/test_hooks_comprehensive.py (6)

1-1: Remove shebang or make file executable.

Tests don’t need a shebang; drop it for cleanliness.

-#!/usr/bin/env python3

7-13: Add a global timeout for HTTP calls.

Avoid hanging tests; use a single TIMEOUT constant.

 import requests
 import json
 import time
+import os
 from typing import Dict, Any
 
-API_BASE_URL = "http://localhost:11234"
+API_BASE_URL = "http://localhost:11234"
+TIMEOUT = int(os.getenv("HOOK_TEST_TIMEOUT", "30"))

168-171: Pass an explicit timeout to requests.post.

Prevents indefinite hangs and aligns with best practices.

-response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)

Also applies to: 281-283, 375-379, 465-469


189-195: Remove pointless f-strings.

These f-strings have no placeholders.

-                print(f"\n📈 Execution Statistics:")
+                print("\n📈 Execution Statistics:")
@@
-                print(f"\n📝 Execution Log:")
+                print("\n📝 Execution Log:")
@@
-            print(f"\n📄 Crawl Results:")
+            print("\n📄 Crawl Results:")

Also applies to: 197-205


213-219: Avoid bare except; catch JSON decode errors precisely.

Catching everything hides real issues.

-        except:
-            print(f"Error text: {response.text[:500]}")
+        except (ValueError, json.JSONDecodeError):
+            print(f"Error text: {response.text[:500]}")

501-505: Narrow the exception in the test runner.

Catching Exception is fine for a top-level test harness, but prefer logging the type and message.

-        except Exception as e:
-            print(f"❌ {name} failed: {e}")
+        except Exception as e:
+            print(f"❌ {name} failed: {type(e).__name__}: {e}")
             import traceback
             traceback.print_exc()
crawl4ai/utils.py (1)

2187-2196: HTTPS preservation: compare hostnames (ignore default ports) and deduplicate logic.

Current check uses parsed_full.netloc == parsed_base.netloc. This fails when one side includes default ports (e.g., example.com vs example.com:443) and repeats across three functions.

  • Compare hostname fields to ignore ports.
  • Factor the preserve-HTTPS snippet into a helper to avoid divergence.
-        if (parsed_full.scheme == 'http' and 
-            parsed_full.netloc == parsed_base.netloc and
+        if (parsed_full.scheme == 'http' and 
+            parsed_full.hostname == parsed_base.hostname and
             not href.strip().startswith('//')):
             full_url = full_url.replace('http://', 'https://', 1)

Apply the same change in normalize_url_for_deep_crawl and efficient_normalize_url_for_deep_crawl.

Optional: extract helper (outside these ranges):

def _preserve_https_if_internal(full_url, href, base_url, preserve_https, original_scheme):
    if preserve_https and original_scheme == 'https':
        pf, pb = urlparse(full_url), urlparse(base_url)
        if pf.scheme == 'http' and pf.hostname == pb.hostname and not href.strip().startswith('//'):
            return full_url.replace('http://', 'https://', 1)
    return full_url

Then call it in all three places.

Also applies to: 2260-2268, 2317-2325

crawl4ai/browser_adapter.py (2)

173-185: Use callable() and iscoroutinefunction; avoid blanket except.

  • Replace hasattr(self._stealth_function, '__call__') with callable(...).
  • Use inspect.iscoroutinefunction to decide await.
  • Avoid silent except Exception: pass to preserve diagnosability (log at least).
+import inspect
@@
-        if self._stealth_available and self._stealth_function:
-            try:
-                if hasattr(self._stealth_function, '__call__'):
-                    if 'async' in getattr(self._stealth_function, '__name__', ''):
-                        await self._stealth_function(page)
-                    else:
-                        self._stealth_function(page)
-            except Exception as e:
-                # Fail silently or log error depending on requirements
-                pass
+        if self._stealth_available and callable(self._stealth_function):
+            try:
+                if inspect.iscoroutinefunction(self._stealth_function):
+                    await self._stealth_function(page)
+                else:
+                    self._stealth_function(page)
+            except Exception as e:
+                # Log at debug level or collect in a diagnostics sink
+                # print(f"stealth apply failed: {e}")
+                return

261-264: Unused parameter naming to silence linters.

retrieve_console_messages(self, page) ignores page. Consider _page to reflect intentional unused parameter.

-    async def retrieve_console_messages(self, page: Page) -> List[Dict]:
+    async def retrieve_console_messages(self, _page: Page) -> List[Dict]:
         """Not needed for Playwright - messages are captured via events"""
         return []
docs/examples/docker_hooks_examples.py (3)

169-173: Add request timeouts to prevent hangs.

All requests.post(...) calls lack a timeout. Add a module-level TIMEOUT and pass timeout=TIMEOUT.

+TIMEOUT = 30
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)
@@
-    response = requests.post(f"{API_BASE_URL}/crawl", json=payload)
+    response = requests.post(f"{API_BASE_URL}/crawl", json=payload, timeout=TIMEOUT)

Also applies to: 281-283, 376-379, 466-470


190-206: Remove f-strings without placeholders.

These are plain strings; drop the f prefix.

-                print(f"\n📈 Execution Statistics:")
+                print("\n📈 Execution Statistics:")
-                print(f"\n📝 Execution Log:")
+                print("\n📝 Execution Log:")
-            print(f"\n📄 Crawl Results:")
+            print("\n📄 Crawl Results:")

214-220: Narrow exception type when parsing JSON error body.

Use ValueError (or requests.JSONDecodeError) instead of bare except.

-        except:
+        except ValueError:
             print(f"Error text: {response.text[:500]}")
crawl4ai/adaptive_crawler.py (2)

620-624: Type annotation should include None explicitly.

Defaulting llm_config=None with Union[...] triggers type-checker warnings. Use Optional[Union[LLMConfig, Dict]].

-    def __init__(self, embedding_model: str = None, llm_config: Union[LLMConfig, Dict] = None):
+    def __init__(self, embedding_model: str = None, llm_config: Optional[Union[LLMConfig, Dict]] = None):

988-993: Store learning_score metric for downstream consumers.

get_quality_confidence and stats expect learning_score, but calculate_confidence no longer sets it. Persist it in state.metrics.

         score = float((best >= tau).mean()) if tau is not None else float(best.mean())
 
         # Store quick metrics
         state.metrics['coverage_score'] = score
+        state.metrics['learning_score'] = score
         state.metrics['avg_best_similarity'] = float(best.mean())
         state.metrics['median_best_similarity'] = float(np.median(best))
deploy/docker/server.py (2)

575-579: All-results-failed branch can IndexError on empty list.

Guard against empty results before indexing.

-    if all(not result["success"] for result in results["results"]):
-        raise HTTPException(500, f"Crawl request failed: {results['results'][0]['error_message']}")
+    if not results["results"] or all(not r.get("success") for r in results["results"]):
+        first_err = next((r.get("error_message") for r in results.get("results", []) if r.get("error_message")), "Crawl failed")
+        raise HTTPException(500, f"Crawl request failed: {first_err}")

618-626: Header value should be plain string, not JSON-quoted.

hooks_info['status']['status'] is already a string; avoid json.dumps to prevent adding quotes in header value.

-        headers["X-Hooks-Status"] = json.dumps(hooks_info['status']['status'])
+        headers["X-Hooks-Status"] = str(hooks_info['status']['status'])
docs/releases_review/demo_v0.7.5.py (4)

63-93: Hook config: brace-glob won’t match; add marker hook and align summary

  • Playwright’s glob matcher doesn’t support the {png,jpg,...} brace pattern; routes won’t be registered as intended.
  • You check for “Crawl4AI v0.7.5 Docker Hook” in HTML and print summaries for hooks you didn’t configure.

Replace the image-blocking route and add a before_return_html hook to inject a marker. Align the summary with configured hooks.

         "on_page_context_created": """
 async def hook(page, context, **kwargs):
     print("Hook: Setting up page context")
-    # Block images to speed up crawling
-    await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
+    # Block images to speed up crawling (register per-extension)
+    for ext in ["png", "jpg", "jpeg", "gif", "webp"]:
+        await context.route(f"**/*.{ext}", lambda route: route.abort())
     print("Hook: Images blocked")
     return page
 """,
@@
         "before_goto": """
 async def hook(page, context, url, **kwargs):
     print(f"Hook: About to navigate to {url}")
     # Add custom headers
     await page.set_extra_http_headers({
         'X-Test-Header': 'crawl4ai-hooks-test'
     })
     return page
 """,
+        "before_return_html": """
+async def hook(page, context, **kwargs):
+    print("Hook: before_return_html - injecting marker")
+    await page.evaluate(\"\"\"document.body.insertAdjacentHTML('beforeend', '<!-- Crawl4AI v0.7.5 Docker Hook -->')\"\"\" )
+    return page
+""",
     }
@@
-            print("\nHook Execution Summary:")
-            print("🔗 before_goto: URL modified with tracking parameter")
-            print("✅ after_goto: Page navigation completed")
-            print("📝 before_return_html: Content processed and marked")
+            print("\nHook Execution Summary:")
+            print("🔗 before_goto: Custom header set")
+            print("🧹 on_page_context_created: Images blocked")
+            print("📝 before_return_html: Marker injected")

Also applies to: 141-151


47-49: Avoid bare except; catch specific exceptions

Catching everything hides useful diagnostics. Use narrower exceptions for requests and JSON parsing.

-        except:
+        except requests.RequestException:
             return False
@@
-            except:
+            except ValueError:
                 print(f"Raw response: {response.text[:500]}")

Also applies to: 155-158


160-163: Use direct f-string interpolation and keep trace context

Prefer f"{e}" (or {e!s}) to str(e). Optionally print a short traceback when debugging.

-    except requests.exceptions.Timeout:
+    except requests.exceptions.Timeout:
         print("⏰ Request timed out after 60 seconds")
-    except Exception as e:
-        print(f"❌ Error: {str(e)}")
+    except Exception as e:
+        print(f"❌ Error: {e}")
@@
-        except Exception as e:
-            print(f"❌ Demo {i} error: {str(e)}")
+        except Exception as e:
+            print(f"❌ Demo {i} error: {e}")
             print("Continuing to next demo...")
@@
-    except Exception as e:
-        print(f"\n❌ Demo error: {str(e)}")
+    except Exception as e:
+        print(f"\n❌ Demo error: {e}")
         print("Make sure you have the required dependencies installed.")

Also applies to: 276-279, 307-309


194-196: Remove f-strings without placeholders

These strings don’t interpolate values; drop the “f” prefix.

-            print(f"  - Note: Actual LLM call may fail without valid API key")
+            print("  - Note: Actual LLM call may fail without valid API key")
@@
-        import requests
-        print(f"  - Requests library: ✓")
+        import requests
+        print("  - Requests library: ✓")
     except ImportError:
-        print(f"  - Requests library: ❌")
+        print("  - Requests library: ❌")
@@
-            print(f"\n⏹️  Demo interrupted by user")
+            print("\n⏹️  Demo interrupted by user")

Also applies to: 253-256, 275-276

docs/examples/website-to-api/web_scraper_lib.py (3)

53-57: Avoid MD5 for cache keys; switch to SHA-256

MD5 is flagged as insecure even for non-crypto uses. Use SHA-256 for low collision risk at similar cost.

-    def _generate_schema_key(self, url: str, query: str) -> str:
+    def _generate_schema_key(self, url: str, query: str) -> str:
         """Generate a unique key for schema caching based on URL and query."""
-        content = f"{url}:{query}"
-        return hashlib.md5(content.encode()).hexdigest()
+        content = f"{url}:{query}".encode("utf-8")
+        return hashlib.sha256(content).hexdigest()

135-172: Crawl result access and file IO robustness

  • Prefer explicit encodings when caching schema.
  • If CrawlResultContainer does not proxy attributes, accessing result.fit_html could fail. Consider accessing the underlying result safely.
-            html = result.fit_html
+            html = getattr(result, "fit_html", None) or getattr(getattr(result, "result", None), "fit_html", "")
@@
-        with open(schema_path, "w") as f:
+        with open(schema_path, "w", encoding="utf-8") as f:
             json.dump(schema, f, indent=2)
@@
-            result = await crawler.arun(url=url, config=run_config)
+            result = await crawler.arun(url=url, config=run_config)
             # Parse extracted_content if it's a JSON string
             extracted_data = result.extracted_content

Optionally, guard extracted_content access similarly to fit_html if Container may not proxy.

Also applies to: 199-203, 328-335


354-370: Nit: use encoding and handle non-existent dir cleanly

Use explicit encoding and ignore missing dir errors.

-        import shutil
-        if os.path.exists(self.schemas_dir):
-            shutil.rmtree(self.schemas_dir)
+        import shutil
+        if os.path.exists(self.schemas_dir):
+            shutil.rmtree(self.schemas_dir)
         os.makedirs(self.schemas_dir, exist_ok=True)
         print("Schema cache cleared")

Optionally wrap rmtree in try/except OSError to report permission issues.

Comment on lines +728 to 730
provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
api_token = llm_config_dict.get('api_token') if llm_config_dict else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Hardcoded query variations (fried rice) are a blocker.

map_query_semantic_space returns a static, irrelevant set of queries. This breaks adaptive crawling for any real query.

Immediate fix: call the LLM when configured; fallback to simple perturbations otherwise.

-        # response = perform_completion_with_backoff(
-        #     provider=provider,
-        #     prompt_with_variables=prompt,
-        #     api_token=api_token,
-        #     json_response=True
-        # )
-        # variations = json.loads(response.choices[0].message.content)
-        # # Mock data with more variations for split
-        variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
+        variations = None
+        if llm_config_dict:
+            try:
+                response = perform_completion_with_backoff(
+                    provider=llm_config_dict.get('provider', 'openai/gpt-4o-mini'),
+                    prompt_with_variables=prompt,
+                    api_token=llm_config_dict.get('api_token'),
+                    json_response=True
+                )
+                variations = json.loads(response.choices[0].message.content)
+            except Exception:
+                variations = None
+        if not variations or 'queries' not in variations:
+            # Lightweight deterministic fallback
+            base = query.strip().rstrip("?")
+            variations = {'queries': [
+                base,
+                f"{base} overview",
+                f"{base} tutorial",
+                f"{base} examples",
+                f"{base} best practices",
+                f"{base} troubleshooting",
+                f"{base} advanced",
+                f"{base} quick start",
+                f"{base} guide",
+                f"{base} faq",
+                f"{base} tips"
+            ]}
@@
-        other_queries = variations['queries'].copy()
+        other_queries = [q for q in variations['queries'] if q != query]

Also applies to: 741-744, 765-781

🧰 Tools
🪛 Ruff (0.13.1)

728-728: Local variable provider is assigned to but never used

Remove assignment to unused variable provider

(F841)


729-729: Local variable api_token is assigned to but never used

Remove assignment to unused variable api_token

(F841)

Comment on lines +471 to 486
warnings.warn("The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", UserWarning)
self.proxy = proxy
self.proxy_config = proxy_config
if isinstance(self.proxy_config, dict):
self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
if isinstance(self.proxy_config, str):
self.proxy_config = ProxyConfig.from_string(self.proxy_config)


if self.proxy and self.proxy_config:
warnings.warn("Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", UserWarning)
self.proxy = None
elif self.proxy:
# Convert proxy string to ProxyConfig if proxy_config is not provided
self.proxy_config = ProxyConfig.from_string(self.proxy)
self.proxy = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Deprecation warnings: add stacklevel=2.

So the warning points to the caller.

-            warnings.warn("The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", UserWarning)
+            warnings.warn(
+                "The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.",
+                UserWarning,
+                stacklevel=2,
+            )
@@
-            warnings.warn("Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", UserWarning)
+            warnings.warn(
+                "Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.",
+                UserWarning,
+                stacklevel=2,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
warnings.warn("The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.", UserWarning)
self.proxy = proxy
self.proxy_config = proxy_config
if isinstance(self.proxy_config, dict):
self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
if isinstance(self.proxy_config, str):
self.proxy_config = ProxyConfig.from_string(self.proxy_config)
if self.proxy and self.proxy_config:
warnings.warn("Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.", UserWarning)
self.proxy = None
elif self.proxy:
# Convert proxy string to ProxyConfig if proxy_config is not provided
self.proxy_config = ProxyConfig.from_string(self.proxy)
self.proxy = None
warnings.warn(
"The 'proxy' parameter is deprecated and will be removed in a future release. Use 'proxy_config' instead.",
UserWarning,
stacklevel=2,
)
self.proxy = proxy
self.proxy_config = proxy_config
if isinstance(self.proxy_config, dict):
self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
if isinstance(self.proxy_config, str):
self.proxy_config = ProxyConfig.from_string(self.proxy_config)
if self.proxy and self.proxy_config:
warnings.warn(
"Both 'proxy' and 'proxy_config' are provided. 'proxy_config' will take precedence.",
UserWarning,
stacklevel=2,
)
self.proxy = None
elif self.proxy:
# Convert proxy string to ProxyConfig if proxy_config is not provided
self.proxy_config = ProxyConfig.from_string(self.proxy)
self.proxy = None
🧰 Tools
🪛 Ruff (0.13.1)

471-471: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)


480-480: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

🤖 Prompt for AI Agents
In crawl4ai/async_configs.py around lines 471 to 486, the warnings.warn calls
should point to the caller; update each warnings.warn in this block to include
stacklevel=2 (e.g., warnings.warn(..., UserWarning, stacklevel=2)) so the
warning location is the caller rather than inside this module; ensure you add
stacklevel=2 to the deprecation warning and to the "Both 'proxy' and
'proxy_config'" warning (and any other warnings.warn in this vicinity).

Comment on lines +1069 to +1074
_UNWANTED_PROPS = {
'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Annotate _UNWANTED_PROPS as ClassVar (RUF012).

Prevents it from being treated as a mutable instance attribute.

-from typing import Any, Dict, Optional
+from typing import Any, Dict, Optional, ClassVar
@@
-    _UNWANTED_PROPS = {
+    _UNWANTED_PROPS: ClassVar[dict[str, str]] = {
         'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
         'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
         'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
         'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_UNWANTED_PROPS = {
'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
}
from typing import Any, Dict, Optional, ClassVar
_UNWANTED_PROPS: ClassVar[dict[str, str]] = {
'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
}
🧰 Tools
🪛 Ruff (0.13.1)

1069-1074: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In crawl4ai/async_configs.py around lines 1069 to 1074, the _UNWANTED_PROPS dict
is currently an un-annotated class-level mutable which can be treated as an
instance attribute; annotate it as a ClassVar to signal it is a class-level
constant. Import ClassVar from typing (if not already imported) and change the
declaration to annotate _UNWANTED_PROPS as ClassVar[dict[str, str]] (or
ClassVar[Mapping[str, str]]), leaving the literal contents unchanged so it
remains a class-level attribute not a mutable instance field.

Comment on lines +43 to +47
if isinstance(logger, logging.Logger):
self.logger = logger
else:
# Create a new logger if logger is None, dict, or any other non-Logger type
self.logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep support for LoggerAdapter-style loggers

Before this change we could pass logging.LoggerAdapter (or any duck-typed logger with .info/.debug etc.) and it worked. Now anything that isn’t a concrete logging.Logger silently falls back to logging.getLogger(__name__), dropping adapter context and breaking existing integrations (e.g., structlog wrappers, adapters adding request IDs). Please broaden the check (or use duck typing) instead of hard-rejecting adapters.

-        if isinstance(logger, logging.Logger):
-            self.logger = logger
-        else:
-            # Create a new logger if logger is None, dict, or any other non-Logger type
-            self.logger = logging.getLogger(__name__)
+        if logger is None:
+            self.logger = logging.getLogger(__name__)
+        elif isinstance(logger, (logging.Logger, logging.LoggerAdapter)):
+            self.logger = logger
+        elif hasattr(logger, "info") and hasattr(logger, "debug"):
+            # allow duck-typed loggers (structlog, custom wrappers)
+            self.logger = logger
+        else:
+            self.logger = logging.getLogger(__name__)
🤖 Prompt for AI Agents
In crawl4ai/deep_crawling/bfs_strategy.py around lines 43 to 47, the current
code only accepts concrete logging.Logger instances and replaces anything else
(including logging.LoggerAdapter or duck-typed loggers) with
logging.getLogger(__name__); change the check to accept LoggerAdapter or any
object exposing the standard logging methods instead of hard-rejecting them: if
logger is None create a new logger, otherwise if the passed object has the
typical logging callables (e.g., hasattr(logger, "info") and hasattr(logger,
"debug") and hasattr(logger, "warning")) assign it directly to self.logger; fall
back to logging.getLogger(__name__) only when logger is None or does not expose
the logging methods.

Comment on lines +257 to +264
# Remove any property descriptors that might have been included
# These deprecated properties should not be in the serialized output
for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
if key in result and isinstance(result[key], property):
# del result[key]
# Nasrin: I decided to convert it to string instead of removing it.
result[key] = str(result[key])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Revert to dropping deprecated property placeholders from serialized payloads.

By converting the property descriptors to strings we now emit values like "<property object at 0x...>" in the API payload. Clients that previously relied on the fields being absent (or actual string content) will now receive meaningless placeholders, breaking backward compatibility and defeating the intent of filtering these deprecated keys. Please keep deleting these entries instead of stringifying them.

Apply this diff to restore the previous behavior:

-        for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
-            if key in result and isinstance(result[key], property):
-                # del result[key]
-                # Nasrin: I decided to convert it to string instead of removing it.
-                result[key] = str(result[key])
+        for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
+            if key in result and isinstance(result[key], property):
+                del result[key]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Remove any property descriptors that might have been included
# These deprecated properties should not be in the serialized output
for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
if key in result and isinstance(result[key], property):
# del result[key]
# Nasrin: I decided to convert it to string instead of removing it.
result[key] = str(result[key])
# Remove any property descriptors that might have been included
# These deprecated properties should not be in the serialized output
for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
if key in result and isinstance(result[key], property):
del result[key]
🤖 Prompt for AI Agents
In crawl4ai/models.py around lines 257 to 264, revert the change that
stringifies deprecated property descriptors; instead remove those keys from the
serialized result so clients don’t receive placeholder strings. For each key in
['fit_html','fit_markdown','markdown_v2'] check if the key exists and if its
value is a property (or property-like placeholder) then delete it from result
(e.g., use del result[key] or result.pop(key, None)) rather than converting to
str; keep the rest of the serialization unchanged.

Comment on lines +45 to +49
console.print(Panel.fit(
f"[cyan]POST {endpoint}[/cyan]\n{syntax}",
title=f"[bold blue]{title}[/]",
border_style="blue"
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Restore Rich syntax rendering for request panels

Embedding the Syntax renderable in the f-string coerces it to its repr, so the panel prints something like <rich.syntax.Syntax object …> instead of the highlighted payload. Let Rich handle the renderable directly.

-    console.print(Panel.fit(
-        f"[cyan]POST {endpoint}[/cyan]\n{syntax}",
-        title=f"[bold blue]{title}[/]",
-        border_style="blue"
-    ))
+    console.print(Panel.fit(
+        syntax,
+        title=f"[bold blue]{title}[/]",
+        border_style="blue",
+        subtitle=f"[cyan]POST {endpoint}[/cyan]",
+        subtitle_align="left",
+    ))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
console.print(Panel.fit(
f"[cyan]POST {endpoint}[/cyan]\n{syntax}",
title=f"[bold blue]{title}[/]",
border_style="blue"
))
console.print(Panel.fit(
syntax,
title=f"[bold blue]{title}[/]",
border_style="blue",
subtitle=f"[cyan]POST {endpoint}[/cyan]",
subtitle_align="left",
))
🤖 Prompt for AI Agents
In tests/docker/test_llm_params.py around lines 45-49 the Syntax object is being
embedded in an f-string which coerces it to its repr; instead pass renderables
directly to Panel.fit so Rich can render the Syntax. Replace the f-string usage
with a renderable container (e.g., rich.console.Group or a list/tuple) that
includes the POST header string and the Syntax object, and pass that container
as the first argument to Panel.fit while keeping the title and border_style
unchanged.

Comment on lines +167 to +175
async with async_client.stream("POST", "/crawl", json=payload) as response:
assert response.status_code == 200
assert response.headers["content-type"] == "application/x-ndjson"
assert response.headers.get("x-stream-status") == "active"

results = await process_streaming_response(response)

assert len(results) == 1
result = results[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Loosen the streaming content-type assertion
httpx will happily surface content-type values like application/x-ndjson; charset=utf-8. With the current exact match, this test will fail even though the server is doing the right thing. Please relax the check (e.g., use .startswith("application/x-ndjson") or parse the media type) so we only fail on genuinely incorrect responses.

-            assert response.headers["content-type"] == "application/x-ndjson"
+            content_type = response.headers.get("content-type", "")
+            assert content_type.startswith("application/x-ndjson")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async with async_client.stream("POST", "/crawl", json=payload) as response:
assert response.status_code == 200
assert response.headers["content-type"] == "application/x-ndjson"
assert response.headers.get("x-stream-status") == "active"
results = await process_streaming_response(response)
assert len(results) == 1
result = results[0]
async with async_client.stream("POST", "/crawl", json=payload) as response:
assert response.status_code == 200
content_type = response.headers.get("content-type", "")
assert content_type.startswith("application/x-ndjson")
assert response.headers.get("x-stream-status") == "active"
results = await process_streaming_response(response)
assert len(results) == 1
result = results[0]
🤖 Prompt for AI Agents
In tests/docker/test_server_requests.py around lines 167 to 175, the test
asserts an exact content-type match "application/x-ndjson" which fails when the
server returns parameters like a charset; change the assertion to accept media
type variants by checking that
response.headers["content-type"].lower().startswith("application/x-ndjson") (or
parse the media type and compare only the type/subtype) so the test only fails
for genuinely wrong content-types.

Comment on lines +10 to +63
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

async def test_best_first_strategy():
"""Test BestFirstCrawlingStrategy with keyword scoring"""

print("=" * 70)
print("Testing BestFirstCrawlingStrategy with Real URL")
print("=" * 70)
print("\nThis test will:")
print("1. Crawl Python.org documentation")
print("2. Score pages based on keywords: 'tutorial', 'guide', 'reference'")
print("3. Show that higher-scoring pages are crawled first")
print("-" * 70)

# Create a keyword scorer that prioritizes tutorial/guide pages
scorer = KeywordRelevanceScorer(
keywords=["tutorial", "guide", "reference", "documentation"],
weight=1.0,
case_sensitive=False
)

# Create the strategy with scoring
strategy = BestFirstCrawlingStrategy(
max_depth=2, # Crawl 2 levels deep
max_pages=10, # Limit to 10 pages total
url_scorer=scorer, # Use keyword scoring
include_external=False # Only internal links
)

# Configure browser and crawler
browser_config = BrowserConfig(
headless=True, # Run in background
verbose=False # Reduce output noise
)

crawler_config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
verbose=False
)

print("\nStarting crawl of https://docs.python.org/3/")
print("Looking for pages with keywords: tutorial, guide, reference, documentation")
print("-" * 70)

crawled_urls = []

async with AsyncWebCrawler(config=browser_config) as crawler:
# Crawl and collect results
results = await crawler.arun(
url="https://docs.python.org/3/",
config=crawler_config
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid external network calls in the test suite

This test hits https://docs.python.org/3/, so it will flake whenever the network, DNS, or the remote site is slow or offline. Please swap in a local fixture (mock server, recorded response, or static file) or mark the test skipped unless an opt-in flag enables network access.

🤖 Prompt for AI Agents
In tests/general/test_bff_scoring.py around lines 10 to 63, the test performs an
external network call to https://docs.python.org/3/ which makes the suite flaky;
replace the external request with a deterministic local fixture or opt-in skip:
either (a) add a pytest fixture that starts a local HTTP test server serving a
small snapshot HTML set and point the crawler to that local URL, or (b)
monkeypatch/fixture the AsyncWebCrawler.arun method to return a canned result
set (or use a recorded response), or (c) mark the test with
pytest.mark.skipif(not os.getenv("ENABLE_NETWORK_TESTS")) so it only runs when
an opt-in environment variable is set. Ensure the test no longer relies on
external DNS/network and update setup/teardown fixtures accordingly.

Comment on lines +94 to +110
# Check if higher scores appear early in the crawl
scores = [item['score'] for item in crawled_urls[1:]] # Skip initial URL
high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]

if high_score_indices and high_score_indices[0] < len(scores) / 2:
print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!")
print(" This confirms the priority queue fix is working.")
else:
print("⚠️ Check the crawl order above - higher scores should appear early")

# Show score distribution
print(f"\nScore Statistics:")
print(f" - Total pages crawled: {len(crawled_urls)}")
print(f" - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")
print(f" - Max score: {max(item['score'] for item in crawled_urls):.2f}")
print(f" - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add assertions so the test actually enforces behavior

Right now the “analysis” block only prints guidance; the test never fails if the priority queue regresses. Convert those checks into real assertions (or explicit pytest.fail) so a scoring regression fails the build instead of silently passing.

-    if high_score_indices and high_score_indices[0] < len(scores) / 2:
-        print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!")
-        print("   This confirms the priority queue fix is working.")
-    else:
-        print("⚠️  Check the crawl order above - higher scores should appear early")
+    assert scores, "No crawl results recorded; check earlier setup to ensure the strategy ran"
+    assert high_score_indices and high_score_indices[0] < len(scores) / 2, (
+        "Higher-scoring pages should be crawled early; inspect priority queue weighting"
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check if higher scores appear early in the crawl
scores = [item['score'] for item in crawled_urls[1:]] # Skip initial URL
high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]
if high_score_indices and high_score_indices[0] < len(scores) / 2:
print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!")
print(" This confirms the priority queue fix is working.")
else:
print("⚠️ Check the crawl order above - higher scores should appear early")
# Show score distribution
print(f"\nScore Statistics:")
print(f" - Total pages crawled: {len(crawled_urls)}")
print(f" - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")
print(f" - Max score: {max(item['score'] for item in crawled_urls):.2f}")
print(f" - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")
# Check if higher scores appear early in the crawl
scores = [item['score'] for item in crawled_urls[1:]] # Skip initial URL
high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]
assert scores, "No crawl results recorded; check earlier setup to ensure the strategy ran"
assert high_score_indices and high_score_indices[0] < len(scores) / 2, (
"Higher-scoring pages should be crawled early; inspect priority queue weighting"
)
# Show score distribution
print(f"\nScore Statistics:")
print(f" - Total pages crawled: {len(crawled_urls)}")
print(f" - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")
print(f" - Max score: {max(item['score'] for item in crawled_urls):.2f}")
print(f" - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")
🧰 Tools
🪛 Ruff (0.13.1)

105-105: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In tests/general/test_bff_scoring.py around lines 94 to 110, the test currently
only prints pass/fail guidance for crawl ordering and score stats, so
regressions won’t fail CI; replace the prints with real test assertions (or
pytest.fail) that enforce behavior: assert that there is at least one
high-scoring page (e.g. any score > 0.3) and that the first high-scoring page
index (relative to scores after skipping initial URL) is < len(scores) / 2; also
assert non-zero total pages before computing averages to avoid ZeroDivisionError
and assert expected ranges for average/max scores if desired, failing explicitly
when conditions aren’t met.

Comment on lines +9 to +36
warnings.simplefilter("always", DeprecationWarning)

proxy_str = "23.95.150.145:6114:username:password"
with warnings.catch_warnings(record=True) as caught:
cfg = BrowserConfig(proxy=proxy_str, headless=True)

dep_warnings = [w for w in caught if issubclass(w.category, DeprecationWarning)]
assert dep_warnings, "Expected DeprecationWarning when using BrowserConfig(proxy=...)"

assert cfg.proxy is None, "cfg.proxy should be None after auto-conversion"
assert isinstance(cfg.proxy_config, ProxyConfig), "cfg.proxy_config should be ProxyConfig instance"
assert cfg.proxy_config.username == "username"
assert cfg.proxy_config.password == "password"
assert cfg.proxy_config.server.startswith("http://")
assert cfg.proxy_config.server.endswith(":6114")


def test_browser_config_with_proxy_config_emits_no_deprecation():
warnings.simplefilter("always", DeprecationWarning)

with warnings.catch_warnings(record=True) as caught:
cfg = BrowserConfig(
headless=True,
proxy_config={
"server": "http://127.0.0.1:8080",
"username": "u",
"password": "p",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid leaking global warning filters
Calling warnings.simplefilter("always", DeprecationWarning) outside the catch_warnings context mutates the global filter list for the entire test run. That means every subsequent test (even in other modules) will start emitting DeprecationWarnings, which is exactly the kind of test cross-talk we try to prevent. Please move the filter inside the catch_warnings context (and do the same in the second test) so the override is scoped to this block.

-    warnings.simplefilter("always", DeprecationWarning)
-
-    proxy_str = "23.95.150.145:6114:username:password"
-    with warnings.catch_warnings(record=True) as caught:
+    proxy_str = "23.95.150.145:6114:username:password"
+    with warnings.catch_warnings(record=True) as caught:
+        warnings.simplefilter("always", DeprecationWarning)
         cfg = BrowserConfig(proxy=proxy_str, headless=True)
@@
-    warnings.simplefilter("always", DeprecationWarning)
-
-    with warnings.catch_warnings(record=True) as caught:
+    with warnings.catch_warnings(record=True) as caught:
+        warnings.simplefilter("always", DeprecationWarning)
         cfg = BrowserConfig(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
warnings.simplefilter("always", DeprecationWarning)
proxy_str = "23.95.150.145:6114:username:password"
with warnings.catch_warnings(record=True) as caught:
cfg = BrowserConfig(proxy=proxy_str, headless=True)
dep_warnings = [w for w in caught if issubclass(w.category, DeprecationWarning)]
assert dep_warnings, "Expected DeprecationWarning when using BrowserConfig(proxy=...)"
assert cfg.proxy is None, "cfg.proxy should be None after auto-conversion"
assert isinstance(cfg.proxy_config, ProxyConfig), "cfg.proxy_config should be ProxyConfig instance"
assert cfg.proxy_config.username == "username"
assert cfg.proxy_config.password == "password"
assert cfg.proxy_config.server.startswith("http://")
assert cfg.proxy_config.server.endswith(":6114")
def test_browser_config_with_proxy_config_emits_no_deprecation():
warnings.simplefilter("always", DeprecationWarning)
with warnings.catch_warnings(record=True) as caught:
cfg = BrowserConfig(
headless=True,
proxy_config={
"server": "http://127.0.0.1:8080",
"username": "u",
"password": "p",
},
proxy_str = "23.95.150.145:6114:username:password"
with warnings.catch_warnings(record=True) as caught:
warnings.simplefilter("always", DeprecationWarning)
cfg = BrowserConfig(proxy=proxy_str, headless=True)
dep_warnings = [w for w in caught if issubclass(w.category, DeprecationWarning)]
assert dep_warnings, "Expected DeprecationWarning when using BrowserConfig(proxy=...)"
assert cfg.proxy is None, "cfg.proxy should be None after auto-conversion"
assert isinstance(cfg.proxy_config, ProxyConfig), "cfg.proxy_config should be ProxyConfig instance"
assert cfg.proxy_config.username == "username"
assert cfg.proxy_config.password == "password"
assert cfg.proxy_config.server.startswith("http://")
assert cfg.proxy_config.server.endswith(":6114")
def test_browser_config_with_proxy_config_emits_no_deprecation():
with warnings.catch_warnings(record=True) as caught:
warnings.simplefilter("always", DeprecationWarning)
cfg = BrowserConfig(
headless=True,
proxy_config={
"server": "http://127.0.0.1:8080",
"username": "u",
"password": "p",
},
)
🧰 Tools
🪛 Ruff (0.13.1)

21-21: Possible hardcoded password assigned to: "password"

(S105)

🤖 Prompt for AI Agents
In tests/proxy/test_proxy_deprecation.py around lines 9 to 36, the call to
warnings.simplefilter("always", DeprecationWarning) is applied globally before
entering warnings.catch_warnings, which mutates global warning filters; move the
simplefilter call inside each with warnings.catch_warnings(record=True) as
caught: block (for both tests) so the filter is scoped to the context manager,
ensuring you set the filter immediately after entering the catch_warnings block
and before creating the BrowserConfig, and remove the external/global
simplefilter call.

ntohidi and others added 10 commits September 30, 2025 11:54
- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design
- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security
…PI endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index
Fix: run_urls() returns None, crashing arun_many()
Marketplace and brand book changes
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (6)
deploy/docker/api.py (2)

263-265: Consider parameter naming consistency.

The parameter api_base_url is used here, while most other functions in this file use base_url for the same purpose (lines 124, 191). Consider renaming to base_url for consistency across the API surface, unless there's a specific reason for the different name.

Apply this diff to align the parameter name:

 async def handle_llm_request(
     redis: aioredis.Redis,
     background_tasks: BackgroundTasks,
     request: Request,
     input_path: str,
     query: Optional[str] = None,
     schema: Optional[str] = None,
     cache: str = "0",
     config: Optional[dict] = None,
     provider: Optional[str] = None,
     temperature: Optional[float] = None,
-    api_base_url: Optional[str] = None
+    base_url: Optional[str] = None
 ) -> JSONResponse:

And update the call at line 298:

         return await create_new_task(
             redis,
             background_tasks,
             input_path,
             query,
             schema,
             cache,
             base_url,
             config,
             provider,
             temperature,
-            api_base_url
+            base_url
         )

And update the signature at line 345:

 async def create_new_task(
     redis: aioredis.Redis,
     background_tasks: BackgroundTasks,
     input_path: str,
     query: str,
     schema: Optional[str],
     cache: str,
     base_url: str,
     config: dict,
     provider: Optional[str] = None,
     temperature: Optional[float] = None,
-    api_base_url: Optional[str] = None
+    base_url: Optional[str] = None
 ) -> JSONResponse:

And update the call at line 372:

     background_tasks.add_task(
         process_llm_extraction,
         redis,
         config,
         task_id,
         decoded_url,
         query,
         schema,
         cache,
         provider,
         temperature,
-        api_base_url
+        base_url
     )

692-692: Consider simplifying the timestamp creation.

The pattern datetime.now(timezone.utc).replace(tzinfo=None).isoformat() creates a timezone-aware datetime but then immediately removes the timezone information. This has the same effect as datetime.utcnow().isoformat() but is more verbose.

If the goal is to store a UTC timestamp without timezone information (possibly for Redis compatibility), consider using the simpler pattern for clarity:

-        "created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
+        "created_at": datetime.utcnow().isoformat(),

Alternatively, if you want to preserve timezone information in the stored timestamp, remove the .replace(tzinfo=None) call:

-        "created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
+        "created_at": datetime.now(timezone.utc).isoformat(),
docs/md_v2/marketplace/README.md (1)

25-25: Wrap bare URL to satisfy markdownlint.

markdownlint (MD034) flags the bare http://localhost:8100. Enclose it in angle brackets (<http://…>) so the docs pipeline stays green.

docs/md_v2/marketplace/marketplace.css (1)

1-957: Avoid maintaining two divergent copies of the same stylesheet.

This file is byte-for-byte identical to docs/md_v2/marketplace/frontend/marketplace.css. Carrying two full copies will drift almost immediately and doubles every future CSS change. Please move the shared rules into a single asset (e.g., keep one file and @import it from the other, or factor common pieces into assets/styles.css) so both contexts stay in sync with one source of truth.

docs/md_v2/marketplace/backend/config.py (1)

48-59: Make ALLOWED_ORIGINS immutable.

ALLOWED_ORIGINS is a mutable list at class scope, so any accidental mutation is shared process-wide and Ruff flags it (RUF012). Switch to an immutable tuple (or annotate with ClassVar) to satisfy the linter and avoid shared-state surprises.

-    ALLOWED_ORIGINS = [
+    ALLOWED_ORIGINS = (
         "http://localhost:8000",
         "http://localhost:8080",
         "http://localhost:8100",
         "http://127.0.0.1:8000",
         "http://127.0.0.1:8080",
         "http://127.0.0.1:8100",
         "https://crawl4ai.com",
         "https://www.crawl4ai.com",
         "https://docs.crawl4ai.com",
         "https://market.crawl4ai.com"
-    ]
+    )

Based on static analysis hints.

docs/md_v2/marketplace/frontend/app-detail.js (1)

147-161: Update proxy example to the new proxy_config API.

v0.7.5 deprecates the proxy= kwarg in favor of proxy_config, but the Proxy Services snippet still shows the old signature. Please refresh the example so the docs reinforce the new structure.

-async with AsyncWebCrawler(proxy=proxy_config) as crawler:
+async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 361499d and 9900f63.

⛔ Files ignored due to path filters (1)
  • docs/md_v2/assets/images/logo.png is excluded by !**/*.png
📒 Files selected for processing (29)
  • .gitignore (2 hunks)
  • crawl4ai/async_dispatcher.py (1 hunks)
  • deploy/docker/api.py (19 hunks)
  • docs/md_v2/assets/page_actions.css (1 hunks)
  • docs/md_v2/assets/page_actions.js (1 hunks)
  • docs/md_v2/branding/index.md (1 hunks)
  • docs/md_v2/marketplace/README.md (1 hunks)
  • docs/md_v2/marketplace/admin/admin.css (1 hunks)
  • docs/md_v2/marketplace/admin/admin.js (1 hunks)
  • docs/md_v2/marketplace/admin/index.html (1 hunks)
  • docs/md_v2/marketplace/app-detail.css (1 hunks)
  • docs/md_v2/marketplace/app-detail.js (1 hunks)
  • docs/md_v2/marketplace/backend/.env.example (1 hunks)
  • docs/md_v2/marketplace/backend/config.py (1 hunks)
  • docs/md_v2/marketplace/backend/database.py (1 hunks)
  • docs/md_v2/marketplace/backend/dummy_data.py (1 hunks)
  • docs/md_v2/marketplace/backend/requirements.txt (1 hunks)
  • docs/md_v2/marketplace/backend/schema.yaml (1 hunks)
  • docs/md_v2/marketplace/backend/server.py (1 hunks)
  • docs/md_v2/marketplace/frontend/app-detail.css (1 hunks)
  • docs/md_v2/marketplace/frontend/app-detail.html (1 hunks)
  • docs/md_v2/marketplace/frontend/app-detail.js (1 hunks)
  • docs/md_v2/marketplace/frontend/index.html (1 hunks)
  • docs/md_v2/marketplace/frontend/marketplace.css (1 hunks)
  • docs/md_v2/marketplace/frontend/marketplace.js (1 hunks)
  • docs/md_v2/marketplace/index.html (1 hunks)
  • docs/md_v2/marketplace/marketplace.css (1 hunks)
  • docs/md_v2/marketplace/marketplace.js (1 hunks)
  • mkdocs.yml (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/md_v2/marketplace/backend/schema.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitignore
🧰 Additional context used
🧬 Code graph analysis (9)
docs/md_v2/marketplace/backend/dummy_data.py (1)
docs/md_v2/marketplace/backend/database.py (1)
  • DatabaseManager (7-117)
docs/md_v2/marketplace/frontend/app-detail.js (4)
docs/md_v2/marketplace/admin/admin.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/marketplace.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/marketplace.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/marketplace.js (4)
docs/md_v2/marketplace/admin/admin.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/marketplace.js (3)
  • API_BASE (2-2)
  • CACHE_TTL (3-3)
  • marketplace (392-392)
docs/md_v2/marketplace/app-detail.js (4)
docs/md_v2/marketplace/admin/admin.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/marketplace.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/marketplace.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/admin/admin.js (1)
docs/md_v2/marketplace/marketplace.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/marketplace.js (3)
docs/md_v2/marketplace/admin/admin.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/app-detail.js (1)
  • API_BASE (2-2)
deploy/docker/api.py (4)
deploy/docker/utils.py (4)
  • validate_llm_provider (92-108)
  • get_llm_temperature (111-145)
  • get_llm_base_url (148-171)
  • get_llm_api_key (74-89)
crawl4ai/async_configs.py (1)
  • LLMConfig (1703-1785)
deploy/docker/hook_manager.py (3)
  • attach_user_hooks_to_crawler (455-512)
  • UserHookManager (15-270)
  • get_summary (256-270)
crawl4ai/models.py (1)
  • model_dump (240-268)
docs/md_v2/marketplace/backend/server.py (2)
docs/md_v2/marketplace/backend/database.py (3)
  • DatabaseManager (7-117)
  • get_all (80-89)
  • search (91-113)
docs/md_v2/marketplace/backend/config.py (1)
  • Config (30-59)
docs/md_v2/marketplace/backend/database.py (1)
docs/md_v2/marketplace/backend/server.py (1)
  • search (146-164)
🪛 dotenv-linter (3.3.0)
docs/md_v2/marketplace/backend/.env.example

[warning] 14-14: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

🪛 markdownlint-cli2 (0.18.1)
docs/md_v2/branding/index.md

755-755: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


1046-1046: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


1057-1057: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


1068-1068: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


1218-1218: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


1230-1230: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)


1248-1248: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

docs/md_v2/marketplace/README.md

25-25: Bare URL used

(MD034, no-bare-urls)

🪛 Ruff (0.13.2)
docs/md_v2/marketplace/backend/config.py

48-59: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

docs/md_v2/marketplace/backend/dummy_data.py

17-17: Possible SQL injection vector through string-based query construction

(S608)


136-136: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


136-136: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


137-137: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


137-137: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


149-149: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


230-230: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


247-247: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


248-248: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

deploy/docker/api.py

544-544: Do not catch blind exception: Exception

(BLE001)


545-545: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


576-576: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


584-584: Consider moving this statement to an else block

(TRY300)


659-659: Consider moving this statement to an else block

(TRY300)

docs/md_v2/marketplace/backend/server.py

179-179: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


239-239: Possible SQL injection vector through string-based query construction

(S608)


242-242: Consider moving this statement to an else block

(TRY300)


243-243: Do not catch blind exception: Exception

(BLE001)


244-244: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


257-257: Possible SQL injection vector through string-based query construction

(S608)


258-258: Consider [*list(app_data.values()), app_id] instead of concatenation

Replace with [*list(app_data.values()), app_id]

(RUF005)


260-260: Consider moving this statement to an else block

(TRY300)


261-261: Do not catch blind exception: Exception

(BLE001)


262-262: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


284-284: Possible SQL injection vector through string-based query construction

(S608)


287-287: Consider moving this statement to an else block

(TRY300)


288-288: Do not catch blind exception: Exception

(BLE001)


289-289: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


301-301: Possible SQL injection vector through string-based query construction

(S608)


302-302: Consider [*list(article_data.values()), article_id] instead of concatenation

Replace with [*list(article_data.values()), article_id]

(RUF005)


304-304: Consider moving this statement to an else block

(TRY300)


305-305: Do not catch blind exception: Exception

(BLE001)


306-306: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


324-324: Possible SQL injection vector through string-based query construction

(S608)


327-327: Consider moving this statement to an else block

(TRY300)


328-328: Do not catch blind exception: Exception

(BLE001)


329-329: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


337-337: Possible SQL injection vector through string-based query construction

(S608)


338-338: Consider [*list(category_data.values()), cat_id] instead of concatenation

Replace with [*list(category_data.values()), cat_id]

(RUF005)


340-340: Consider moving this statement to an else block

(TRY300)


341-341: Do not catch blind exception: Exception

(BLE001)


342-342: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


352-352: Possible SQL injection vector through string-based query construction

(S608)


355-355: Consider moving this statement to an else block

(TRY300)


356-356: Do not catch blind exception: Exception

(BLE001)


357-357: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


365-365: Possible SQL injection vector through string-based query construction

(S608)


366-366: Consider [*list(sponsor_data.values()), sponsor_id] instead of concatenation

Replace with [*list(sponsor_data.values()), sponsor_id]

(RUF005)


368-368: Consider moving this statement to an else block

(TRY300)


369-369: Do not catch blind exception: Exception

(BLE001)


370-370: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

docs/md_v2/marketplace/backend/database.py

31-31: f-string without any placeholders

Remove extraneous f prefix

(F541)


80-80: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


82-82: Possible SQL injection vector through string-based query construction

(S608)


91-91: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


108-108: Possible SQL injection vector through string-based query construction

(S608)

crawl4ai/async_dispatcher.py

468-468: return inside finally blocks cause exceptions to be silenced

(B012)

🔇 Additional comments (18)
deploy/docker/api.py (18)

7-7: LGTM!

Adding timezone support for task creation timestamps is a good practice for consistent time handling across deployments.


45-47: LGTM!

The new utility functions properly centralize LLM configuration logic with provider-specific environment variable resolution.


101-103: LGTM!

The migration to centralized helper functions is correct, and the approach of returning None to let litellm handle defaults is appropriate.


122-124: LGTM!

Adding per-request LLM configuration parameters aligns with the PR objectives for enhanced LLM integration with multi-provider support.


136-142: LGTM!

The configuration priority (request parameters → provider-specific env vars → global defaults) is correctly implemented and aligns with the multi-provider environment variable support described in PR objectives.


189-191: LGTM!

Consistent parameter additions for LLM configuration propagation across the API surface.


216-218: LGTM!

LLM configuration follows the established pattern consistently.


416-418: LGTM!

Proper serialization guard for fit_html. The defensive check prevents serialization errors while maintaining backward compatibility by setting non-serializable values to None.


446-447: LGTM!

Adding hooks support to the crawl request handler aligns with the Docker Hooks System feature described in PR objectives.


473-485: LGTM!

Hooks attachment logic is well-structured with proper error isolation and configurable timeouts. The local imports keep the dependency optional.


488-493: LGTM!

The improved base config merging logic now correctly handles both None and empty string cases, ensuring user-provided values always take precedence.


503-505: LGTM!

Normalizing results to always be a list improves API contract consistency and simplifies downstream processing.


520-550: LGTM!

The enhanced result processing significantly improves robustness by handling multiple result types and serialization edge cases. The broad exception catching (flagged by static analysis at line 544) is appropriate here for defensive error handling to maintain API stability.


535-537: LGTM!

Consistent serialization handling for fit_html across streaming and non-streaming code paths.


560-583: LGTM!

The hooks information integration is well-designed with defensive serialization testing and graceful error handling. The static analysis suggestion (TRY300 at line 584) about moving the return to an else block is a minor style preference that doesn't affect correctness.


639-651: LGTM!

Hooks attachment for streaming follows the same pattern as non-streaming requests, ensuring consistency across both code paths.


659-659: LGTM!

The return statement correctly includes hooks_info to match the updated function signature. The static analysis suggestion (TRY300) is a minor style preference.


614-615: All callers handle the new 3-tuple return signature correctly.

memory_monitor.cancel()
if self.monitor:
self.monitor.stop()
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return in finally block can silence exceptions during cleanup.

Moving the return statement into the finally block guarantees a single exit point but introduces a risk: if an exception occurs during cleanup (e.g., memory_monitor.cancel() at line 465 or self.monitor.stop() at line 467), it will be silenced by the return statement. Additionally, the current exception handling at line 459 catches but does not re-raise exceptions, meaning errors in the main processing loop are already swallowed.

While this pattern may be intentional for graceful degradation (collecting partial results even on errors), it can make debugging difficult if cleanup failures go unnoticed.

Consider one of these approaches:

  1. Preferred: Move the return outside the finally block and explicitly handle the return in both success and exception paths:
         except Exception as e:
             if self.monitor:
                 self.monitor.update_memory_status(f"QUEUE_ERROR: {str(e)}")                
         
         finally:
             # Clean up
             memory_monitor.cancel()
             if self.monitor:
                 self.monitor.stop()
-            return results
+        
+        return results
  1. Alternative: If you intend to suppress cleanup exceptions, document this behavior with a comment explaining why exceptions are swallowed and that partial results are intentionally returned.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return results
except Exception as e:
if self.monitor:
self.monitor.update_memory_status(f"QUEUE_ERROR: {str(e)}")
finally:
# Clean up
memory_monitor.cancel()
if self.monitor:
self.monitor.stop()
return results
🧰 Tools
🪛 Ruff (0.13.2)

468-468: return inside finally blocks cause exceptions to be silenced

(B012)

🤖 Prompt for AI Agents
In crawl4ai/async_dispatcher.py around line 468, the function returns from
inside the finally block which can silently swallow exceptions raised during
cleanup (memory_monitor.cancel() at ~465 or self.monitor.stop() at ~467) and
exceptions from the main processing loop are already being caught without
re-raising; move the return statement out of the finally block so the function
explicitly returns in the normal-success path and in the except path (or
re-raise after logging) instead of from finally, and ensure cleanup exceptions
are either logged and re-raised or explicitly documented if intentionally
suppressed (add a comment explaining suppression behavior if you choose to
swallow cleanup errors).

Comment on lines +317 to +339
.page-actions-overlay {
display: none;
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: rgba(0, 0, 0, 0.5);
z-index: 998;
opacity: 0;
transition: opacity 0.2s ease;
}

.page-actions-overlay.active {
display: block;
opacity: 1;
}

@media (max-width: 768px) {
.page-actions-overlay {
display: block;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent the overlay from blocking all taps on mobile

On mobile, the media query forces .page-actions-overlay to display: block even when it’s not active. Because pointer events stay enabled, the invisible overlay sits above the page and swallows every tap, so users can’t interact with underlying content. Please disable pointer events while the overlay is inactive and re-enable them only for the active state.

 .page-actions-overlay {
     display: none;
     position: fixed;
     top: 0;
     left: 0;
     right: 0;
     bottom: 0;
     background: rgba(0, 0, 0, 0.5);
     z-index: 998;
     opacity: 0;
     transition: opacity 0.2s ease;
+    pointer-events: none;
 }
 
 .page-actions-overlay.active {
     display: block;
     opacity: 1;
+    pointer-events: auto;
 }
🤖 Prompt for AI Agents
In docs/md_v2/assets/page_actions.css around lines 317 to 339, the overlay is
forced to display:block on small screens but still blocks taps when inactive;
update the CSS so the default/inactive .page-actions-overlay has pointer-events:
none (so it won’t swallow touches) and .page-actions-overlay.active has
pointer-events: auto to restore interaction, making sure the media query does
not override the inactive pointer-events setting.

Comment on lines 146 to 163
async apiCall(endpoint, options = {}) {
const response = await fetch(`${API_BASE}${endpoint}`, {
...options,
headers: {
'Authorization': `Bearer ${this.token}`,
'Content-Type': 'application/json',
...options.headers
}
});

if (response.status === 401) {
this.logout();
throw { status: 401 };
}

if (!response.ok) throw new Error(`API Error: ${response.status}`);
return response.json();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle 204/empty responses before calling .json().

apiCall unconditionally parses JSON; any admin DELETE (or other 204 responses) will throw Unexpected end of JSON input, preventing deletions. Guard against empty bodies before calling response.json().

-        if (!response.ok) throw new Error(`API Error: ${response.status}`);
-        return response.json();
+        if (!response.ok) throw new Error(`API Error: ${response.status}`);
+
+        if (response.status === 204) {
+            return null;
+        }
+
+        const text = await response.text();
+        return text ? JSON.parse(text) : null;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async apiCall(endpoint, options = {}) {
const response = await fetch(`${API_BASE}${endpoint}`, {
...options,
headers: {
'Authorization': `Bearer ${this.token}`,
'Content-Type': 'application/json',
...options.headers
}
});
if (response.status === 401) {
this.logout();
throw { status: 401 };
}
if (!response.ok) throw new Error(`API Error: ${response.status}`);
return response.json();
}
async apiCall(endpoint, options = {}) {
const response = await fetch(`${API_BASE}${endpoint}`, {
...options,
headers: {
'Authorization': `Bearer ${this.token}`,
'Content-Type': 'application/json',
...options.headers
}
});
if (response.status === 401) {
this.logout();
throw { status: 401 };
}
if (!response.ok) throw new Error(`API Error: ${response.status}`);
if (response.status === 204) {
return null;
}
const text = await response.text();
return text ? JSON.parse(text) : null;
}
🤖 Prompt for AI Agents
In docs/md_v2/marketplace/admin/admin.js around lines 146 to 163, apiCall
currently always calls response.json(), which fails on 204/empty responses;
update it to detect empty responses before parsing: after checking response.ok,
if response.status === 204 or response.headers.get('Content-Length') === '0' (or
the content-type header is missing/non-JSON) return null (or an appropriate
empty value) instead of calling response.json(); otherwise call and return
response.json(); keep existing 401 handling and error throw behavior.

Comment on lines +16 to +26
async init() {
if (!this.appSlug) {
window.location.href = 'index.html';
return;
}

await this.loadAppDetails();
this.setupEventListeners();
await this.loadRelatedApps();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Stop init flow when app data fails to load.

If loadAppDetails() cannot resolve an app, it redirects but leaves this.appData as null. Execution continues and loadRelatedApps() immediately dereferences this.appData.category, throwing TypeError: Cannot read properties of null. Add an early return after loading (and/or guard inside loadRelatedApps) so the rest of the pipeline only runs when this.appData is populated.

     async init() {
         if (!this.appSlug) {
             window.location.href = 'index.html';
             return;
         }

-        await this.loadAppDetails();
+        await this.loadAppDetails();
+        if (!this.appData) {
+            return;
+        }
         this.setupEventListeners();
         await this.loadRelatedApps();
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async init() {
if (!this.appSlug) {
window.location.href = 'index.html';
return;
}
await this.loadAppDetails();
this.setupEventListeners();
await this.loadRelatedApps();
}
async init() {
if (!this.appSlug) {
window.location.href = 'index.html';
return;
}
await this.loadAppDetails();
if (!this.appData) {
return;
}
this.setupEventListeners();
await this.loadRelatedApps();
}
🤖 Prompt for AI Agents
In docs/md_v2/marketplace/app-detail.js around lines 16 to 26, the init flow
continues after loadAppDetails even when this.appData is null, causing
loadRelatedApps to dereference this.appData.category and throw; after awaiting
this.loadAppDetails() add a guard that returns early if this.appData is falsy
(or alternatively update loadRelatedApps to first check for this.appData and
return/no-op if missing) so setupEventListeners and loadRelatedApps only run
when this.appData is populated.

Comment on lines +8 to +26
def __init__(self, db_path=None, schema_path='schema.yaml'):
self.schema = self._load_schema(schema_path)
# Use provided path or fallback to schema default
self.db_path = db_path or self.schema['database']['name']
self.conn = None
self._init_database()

def _load_schema(self, path: str) -> Dict:
with open(path, 'r') as f:
return yaml.safe_load(f)

def _init_database(self):
"""Auto-create/migrate database from schema"""
self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
self.conn.row_factory = sqlite3.Row

for table_name, table_def in self.schema['tables'].items():
self._create_or_update_table(table_name, table_def['columns'])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make schema and database paths location-agnostic

DatabaseManager() uses whatever the current working directory happens to be. Running python docs/md_v2/marketplace/backend/dummy_data.py from the project root raises FileNotFoundError: 'schema.yaml', so the brand-new seeder/demo can’t even start. Resolve both paths relative to this module so callers (including FastAPI and the CLI seeder) don’t have to cd into the backend folder first.

     def __init__(self, db_path=None, schema_path='schema.yaml'):
-        self.schema = self._load_schema(schema_path)
-        # Use provided path or fallback to schema default
-        self.db_path = db_path or self.schema['database']['name']
+        schema_path = Path(schema_path)
+        if not schema_path.is_absolute():
+            schema_path = Path(__file__).resolve().parent / schema_path
+        self.schema = self._load_schema(schema_path)
+
+        default_db = db_path or self.schema['database']['name']
+        db_path = Path(default_db)
+        if not db_path.is_absolute():
+            db_path = schema_path.parent / db_path
+        db_path.parent.mkdir(parents=True, exist_ok=True)
+        self.db_path = str(db_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, db_path=None, schema_path='schema.yaml'):
self.schema = self._load_schema(schema_path)
# Use provided path or fallback to schema default
self.db_path = db_path or self.schema['database']['name']
self.conn = None
self._init_database()
def _load_schema(self, path: str) -> Dict:
with open(path, 'r') as f:
return yaml.safe_load(f)
def _init_database(self):
"""Auto-create/migrate database from schema"""
self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
self.conn.row_factory = sqlite3.Row
for table_name, table_def in self.schema['tables'].items():
self._create_or_update_table(table_name, table_def['columns'])
def __init__(self, db_path=None, schema_path='schema.yaml'):
- self.schema = self._load_schema(schema_path)
- # Use provided path or fallback to schema default
# Resolve schema_path relative to this module if not absolute
schema_path = Path(schema_path)
if not schema_path.is_absolute():
schema_path = Path(__file__).resolve().parent / schema_path
self.schema = self._load_schema(schema_path)
# Determine db_path (override or from schema) and resolve it
default_db = db_path or self.schema['database']['name']
db_path = Path(default_db)
if not db_path.is_absolute():
db_path = schema_path.parent / db_path
# Ensure parent dirs exist
db_path.parent.mkdir(parents=True, exist_ok=True)
self.db_path = str(db_path)
self.conn = None
self._init_database()
🤖 Prompt for AI Agents
In docs/md_v2/marketplace/backend/database.py around lines 8 to 26, the class
currently resolves schema and database files relative to the current working
directory which causes FileNotFoundError when run from project root; change
resolution to be module-location-agnostic by resolving schema_path against the
module directory (e.g. Path(__file__).parent / schema_path) before opening it,
and if db_path is not provided, resolve the schema's database name relative to
the same module directory; ensure the code uses the resolved Path (or its
string) when opening the schema and when connecting with sqlite3 so absolute
paths from callers remain untouched and relative names become relative to the
backend module.

Comment on lines 56 to 143
where_clauses = []
if category:
where_clauses.append(f"category = '{category}'")
if type:
where_clauses.append(f"type = '{type}'")
if featured is not None:
where_clauses.append(f"featured = {1 if featured else 0}")
if sponsored is not None:
where_clauses.append(f"sponsored = {1 if sponsored else 0}")

where = " AND ".join(where_clauses) if where_clauses else None
apps = db.get_all('apps', limit=limit, offset=offset, where=where)

# Parse JSON fields
for app in apps:
if app.get('screenshots'):
app['screenshots'] = json.loads(app['screenshots'])

return json_response(apps)

@app.get("/api/apps/{slug}")
async def get_app(slug: str):
"""Get single app by slug"""
apps = db.get_all('apps', where=f"slug = '{slug}'", limit=1)
if not apps:
raise HTTPException(status_code=404, detail="App not found")

app = apps[0]
if app.get('screenshots'):
app['screenshots'] = json.loads(app['screenshots'])

return json_response(app)

@app.get("/api/articles")
async def get_articles(
category: Optional[str] = None,
limit: int = Query(default=20, le=10000),
offset: int = Query(default=0)
):
"""Get articles with optional category filter"""
where = f"category = '{category}'" if category else None
articles = db.get_all('articles', limit=limit, offset=offset, where=where)

# Parse JSON fields
for article in articles:
if article.get('related_apps'):
article['related_apps'] = json.loads(article['related_apps'])
if article.get('tags'):
article['tags'] = json.loads(article['tags'])

return json_response(articles)

@app.get("/api/articles/{slug}")
async def get_article(slug: str):
"""Get single article by slug"""
articles = db.get_all('articles', where=f"slug = '{slug}'", limit=1)
if not articles:
raise HTTPException(status_code=404, detail="Article not found")

article = articles[0]
if article.get('related_apps'):
article['related_apps'] = json.loads(article['related_apps'])
if article.get('tags'):
article['tags'] = json.loads(article['tags'])

return json_response(article)

@app.get("/api/categories")
async def get_categories():
"""Get all categories ordered by index"""
categories = db.get_all('categories', limit=50)
categories.sort(key=lambda x: x.get('order_index', 0))
return json_response(categories, cache_time=7200)

@app.get("/api/sponsors")
async def get_sponsors(active: Optional[bool] = True):
"""Get sponsors, default active only"""
where = f"active = {1 if active else 0}" if active is not None else None
sponsors = db.get_all('sponsors', where=where, limit=20)

# Filter by date if active
if active:
now = datetime.now().isoformat()
sponsors = [s for s in sponsors
if (not s.get('start_date') or s['start_date'] <= now) and
(not s.get('end_date') or s['end_date'] >= now)]

return json_response(sponsors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Stop interpolating request data into SQL

get_apps, get_app, get_articles, get_article, and get_sponsors build the WHERE clause by f-stringing user-supplied values directly into SQL (category, type, slug, etc.). That enables trivial SQL injection on every public endpoint. Please switch to parameterized queries (e.g., add a params argument to DatabaseManager.get_all and bind values with ? placeholders) before releasing.

🤖 Prompt for AI Agents
In docs/md_v2/marketplace/backend/server.py around lines 56-143 the handlers
build SQL WHERE clauses by interpolating request values (category, type, slug,
featured, sponsored, active) with f-strings which allows SQL injection; change
all callers to use parameterized queries by updating DatabaseManager.get_all to
accept a params list and use ? placeholders in the WHERE clause instead of
string interpolation, then in each endpoint build the WHERE string with
placeholders (e.g. "category = ?" etc.) and pass the corresponding values in
order (convert booleans to 1/0 or use proper typed params), ensure slug/other
values are passed as params rather than concatenated, and adjust any
JSON/boolean handling to remain the same while ensuring no user data is directly
injected into SQL.

Comment on lines +214 to +275
async loadMainContent() {
// Load apps column
const apps = await this.api.getApps({ limit: 8 });
if (apps && apps.length) {
const appsGrid = document.getElementById('apps-grid');
appsGrid.innerHTML = apps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>★ ${app.rating}/5</span>
</div>
<div class="app-compact-title">${app.name}</div>
<div class="app-compact-desc">${app.description}</div>
</div>
`).join('');
}

// Load articles column
const articles = await this.api.getArticles({ limit: 6 });
if (articles && articles.length) {
const articlesList = document.getElementById('articles-list');
articlesList.innerHTML = articles.map(article => `
<div class="article-compact" onclick="marketplace.showArticle('${article.id}')">
<div class="article-meta">
<span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>
</div>
<div class="article-title">${article.title}</div>
<div class="article-author">by ${article.author}</div>
</div>
`).join('');
}

// Load trending
if (apps && apps.length) {
const trending = apps.slice(0, 5);
const trendingList = document.getElementById('trending-list');
trendingList.innerHTML = trending.map((app, i) => `
<div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="trending-rank">${i + 1}</div>
<div class="trending-info">
<div class="trending-name">${app.name}</div>
<div class="trending-stats">${app.downloads} downloads</div>
</div>
</div>
`).join('');
}

// Load more apps grid
const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
if (moreApps && moreApps.length) {
const moreGrid = document.getElementById('more-apps-grid');
moreGrid.innerHTML = moreApps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>${app.type}</span>
</div>
<div class="app-compact-title">${app.name}</div>
</div>
`).join('');
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wire category/type filters into the data fetches.

filterByCategory and the type dropdown update this.currentCategory/this.currentType, but loadMainContent (and loadMoreApps) still call /apps with only limit/offset. The UI shows active filters, yet the results never change. Please pass the selected filters into every apps fetch so the listings honor user choices.

-        const apps = await this.api.getApps({ limit: 8 });
+        const baseParams = {
+            limit: 8,
+            ...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}),
+            ...(this.currentType ? { type: this.currentType } : {}),
+        };
+        const apps = await this.api.getApps(baseParams);
...
-        const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
+        const moreApps = await this.api.getApps({
+            offset: 8,
+            limit: 12,
+            ...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}),
+            ...(this.currentType ? { type: this.currentType } : {}),
+        });

Be sure to apply the same filtered params in loadMoreApps() as well.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async loadMainContent() {
// Load apps column
const apps = await this.api.getApps({ limit: 8 });
if (apps && apps.length) {
const appsGrid = document.getElementById('apps-grid');
appsGrid.innerHTML = apps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>★ ${app.rating}/5</span>
</div>
<div class="app-compact-title">${app.name}</div>
<div class="app-compact-desc">${app.description}</div>
</div>
`).join('');
}
// Load articles column
const articles = await this.api.getArticles({ limit: 6 });
if (articles && articles.length) {
const articlesList = document.getElementById('articles-list');
articlesList.innerHTML = articles.map(article => `
<div class="article-compact" onclick="marketplace.showArticle('${article.id}')">
<div class="article-meta">
<span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>
</div>
<div class="article-title">${article.title}</div>
<div class="article-author">by ${article.author}</div>
</div>
`).join('');
}
// Load trending
if (apps && apps.length) {
const trending = apps.slice(0, 5);
const trendingList = document.getElementById('trending-list');
trendingList.innerHTML = trending.map((app, i) => `
<div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="trending-rank">${i + 1}</div>
<div class="trending-info">
<div class="trending-name">${app.name}</div>
<div class="trending-stats">${app.downloads} downloads</div>
</div>
</div>
`).join('');
}
// Load more apps grid
const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
if (moreApps && moreApps.length) {
const moreGrid = document.getElementById('more-apps-grid');
moreGrid.innerHTML = moreApps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>${app.type}</span>
</div>
<div class="app-compact-title">${app.name}</div>
</div>
`).join('');
}
}
async loadMainContent() {
// Load apps column
const baseParams = {
limit: 8,
...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}),
...(this.currentType ? { type: this.currentType } : {}),
};
const apps = await this.api.getApps(baseParams);
if (apps && apps.length) {
const appsGrid = document.getElementById('apps-grid');
appsGrid.innerHTML = apps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>★ ${app.rating}/5</span>
</div>
<div class="app-compact-title">${app.name}</div>
<div class="app-compact-desc">${app.description}</div>
</div>
`).join('');
}
// Load articles column
const articles = await this.api.getArticles({ limit: 6 });
if (articles && articles.length) {
const articlesList = document.getElementById('articles-list');
articlesList.innerHTML = articles.map(article => `
<div class="article-compact" onclick="marketplace.showArticle('${article.id}')">
<div class="article-meta">
<span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>
</div>
<div class="article-title">${article.title}</div>
<div class="article-author">by ${article.author}</div>
</div>
`).join('');
}
// Load trending
if (apps && apps.length) {
const trending = apps.slice(0, 5);
const trendingList = document.getElementById('trending-list');
trendingList.innerHTML = trending.map((app, i) => `
<div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="trending-rank">${i + 1}</div>
<div class="trending-info">
<div class="trending-name">${app.name}</div>
<div class="trending-stats">${app.downloads} downloads</div>
</div>
</div>
`).join('');
}
// Load more apps grid
const moreApps = await this.api.getApps({
offset: 8,
limit: 12,
...(this.currentCategory && this.currentCategory !== 'all' ? { category: this.currentCategory } : {}),
...(this.currentType ? { type: this.currentType } : {}),
});
if (moreApps && moreApps.length) {
const moreGrid = document.getElementById('more-apps-grid');
moreGrid.innerHTML = moreApps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>${app.type}</span>
</div>
<div class="app-compact-title">${app.name}</div>
</div>
`).join('');
}
}
🤖 Prompt for AI Agents
In docs/md_v2/marketplace/frontend/marketplace.js around lines 214 to 275,
loadMainContent is calling this.api.getApps without passing the current filters
so UI-selected category/type are ignored; update every this.api.getApps call in
this file (including the later loadMoreApps) to include filter params built from
this.currentCategory and this.currentType (e.g. add category and type fields to
the options object only when they are set) so that the initial apps, trending
slice, and the more-apps fetch all respect the active filters.

Comment on lines 145 to 211
const imageUrl = hero.image || '';
heroCard.innerHTML = `
<div class="hero-image" ${imageUrl ? `style="background-image: url('${imageUrl}')"` : ''}>
${!imageUrl ? `[${hero.category || 'APP'}]` : ''}
</div>
<div class="hero-content">
<span class="hero-badge">${hero.type || 'PAID'}</span>
<h2 class="hero-title">${hero.name}</h2>
<p class="hero-description">${hero.description}</p>
<div class="hero-meta">
<span>★ ${hero.rating || 0}/5</span>
<span>${hero.downloads || 0} downloads</span>
</div>
</div>
`;
heroCard.onclick = () => this.showAppDetail(hero);
}

// Secondary featured cards
const secondary = document.getElementById('featured-secondary');
secondary.innerHTML = '';
if (featured.length > 1) {
featured.slice(1, 4).forEach(app => {
const card = document.createElement('div');
card.className = 'secondary-card';
const imageUrl = app.image || '';
card.innerHTML = `
<div class="secondary-image" ${imageUrl ? `style="background-image: url('${imageUrl}')"` : ''}>
${!imageUrl ? `[${app.category || 'APP'}]` : ''}
</div>
<div class="secondary-content">
<h3 class="secondary-title">${app.name}</h3>
<p class="secondary-desc">${(app.description || '').substring(0, 100)}...</p>
<div class="secondary-meta">
<span>${app.type || 'Open Source'}</span> · <span>★ ${app.rating || 0}/5</span>
</div>
</div>
`;
card.onclick = () => this.showAppDetail(app);
secondary.appendChild(card);
});
}
}

async loadSponsors() {
const sponsors = await this.api.getSponsors();
if (!sponsors || !sponsors.length) {
// Show placeholder if no sponsors
const container = document.getElementById('sponsored-content');
container.innerHTML = `
<div class="sponsor-card">
<h4>Become a Sponsor</h4>
<p>Reach thousands of developers using Crawl4AI</p>
<a href="mailto:[email protected]">Contact Us →</a>
</div>
`;
return;
}

const container = document.getElementById('sponsored-content');
container.innerHTML = sponsors.slice(0, 5).map(sponsor => `
<div class="sponsor-card">
<h4>${sponsor.company_name}</h4>
<p>${sponsor.tier} Sponsor - Premium Solutions</p>
<a href="${sponsor.landing_url}" target="_blank">Learn More →</a>
</div>
`).join('');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Sanitize API data before injecting with innerHTML

Multiple sections write API-provided strings (e.g., hero.description, app.description, article titles) straight into innerHTML. If any marketplace entry contains <script> or similar markup, this becomes an XSS vector. Please render these fields with textContent/DOM builders or run them through a trusted sanitizer before inserting into the DOM.

Also applies to: 214-259, 319-355

Comment on lines +214 to +275
async loadMainContent() {
// Load apps column
const apps = await this.api.getApps({ limit: 8 });
if (apps && apps.length) {
const appsGrid = document.getElementById('apps-grid');
appsGrid.innerHTML = apps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>★ ${app.rating}/5</span>
</div>
<div class="app-compact-title">${app.name}</div>
<div class="app-compact-desc">${app.description}</div>
</div>
`).join('');
}

// Load articles column
const articles = await this.api.getArticles({ limit: 6 });
if (articles && articles.length) {
const articlesList = document.getElementById('articles-list');
articlesList.innerHTML = articles.map(article => `
<div class="article-compact" onclick="marketplace.showArticle('${article.id}')">
<div class="article-meta">
<span>${article.category}</span> · <span>${new Date(article.published_at).toLocaleDateString()}</span>
</div>
<div class="article-title">${article.title}</div>
<div class="article-author">by ${article.author}</div>
</div>
`).join('');
}

// Load trending
if (apps && apps.length) {
const trending = apps.slice(0, 5);
const trendingList = document.getElementById('trending-list');
trendingList.innerHTML = trending.map((app, i) => `
<div class="trending-item" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="trending-rank">${i + 1}</div>
<div class="trending-info">
<div class="trending-name">${app.name}</div>
<div class="trending-stats">${app.downloads} downloads</div>
</div>
</div>
`).join('');
}

// Load more apps grid
const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
if (moreApps && moreApps.length) {
const moreGrid = document.getElementById('more-apps-grid');
moreGrid.innerHTML = moreApps.map(app => `
<div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
<div class="app-compact-header">
<span>${app.category}</span>
<span>${app.type}</span>
</div>
<div class="app-compact-title">${app.name}</div>
</div>
`).join('');
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Category/type filters never applied

loadMainContent and loadMoreApps ignore this.currentCategory/this.currentType, so clicking any filter or changing the type dropdown does nothing—the UI always shows the unfiltered feed. Please pass the active filters into every getApps call (initial grid, “more apps”, and load-more pagination) and reset paging counters when the filter changes.

-        const apps = await this.api.getApps({ limit: 8 });
+        const appParams = { limit: 8 };
+        if (this.currentCategory && this.currentCategory !== 'all') {
+            appParams.category = this.currentCategory;
+        }
+        if (this.currentType) {
+            appParams.type = this.currentType;
+        }
+        const apps = await this.api.getApps(appParams);-        const moreApps = await this.api.getApps({ offset: 8, limit: 12 });
+        const moreAppsParams = { offset: 8, limit: 12 };
+        if (this.currentCategory && this.currentCategory !== 'all') {
+            moreAppsParams.category = this.currentCategory;
+        }
+        if (this.currentType) {
+            moreAppsParams.type = this.currentType;
+        }
+        const moreApps = await this.api.getApps(moreAppsParams);-        const moreApps = await this.api.getApps({ offset: this.loadedApps, limit: 12 });
+        const params = { offset: this.loadedApps, limit: 12 };
+        if (this.currentCategory && this.currentCategory !== 'all') {
+            params.category = this.currentCategory;
+        }
+        if (this.currentType) {
+            params.type = this.currentType;
+        }
+        const moreApps = await this.api.getApps(params);

Also applies to: 358-376

🤖 Prompt for AI Agents
In docs/md_v2/marketplace/marketplace.js around lines 214-275 (and also apply
same fix at 358-376), loadMainContent and load-more logic call this.api.getApps
without respect to this.currentCategory and this.currentType and never reset
paging; update every getApps call to include the active filters (e.g. pass {
category: this.currentCategory, type: this.currentType, limit, offset }) so the
initial apps, trending, and moreApps queries are filtered, ensure
load-more/pagination reuses the same filter object, and reset any paging/offset
counters to 0 whenever the category or type filter changes so the UI shows
filtered results from the first page.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
docs/md_v2/assets/page_actions.css (1)

305-327: Restore mobile tap access by disabling overlay pointer events when inactive

On mobile, the media query forces .page-actions-overlay to display: block even when it isn’t active. Because pointer events stay enabled, this transparent overlay sits on top of the page and swallows all taps, breaking navigation. Default the overlay to pointer-events: none and only enable them in the active state.

 .page-actions-overlay {
     display: none;
     position: fixed;
     top: 0;
     left: 0;
     right: 0;
     bottom: 0;
     background: rgba(0, 0, 0, 0.5);
     z-index: 998;
     opacity: 0;
     transition: opacity 0.2s ease;
+    pointer-events: none;
 }
 
 .page-actions-overlay.active {
     display: block;
     opacity: 1;
+    pointer-events: auto;
 }
docs/md_v2/marketplace/marketplace.js (3)

162-175: Sanitize API data before injecting with innerHTML.

The hero card renders hero.name and hero.description directly into innerHTML. If the API returns malicious markup (e.g., <script> tags), this creates an XSS vector. Use textContent for plain text fields or sanitize HTML before rendering.


236-245: Sanitize API data before injecting with innerHTML.

Similar to the hero card, the apps grid injects app.name, app.category, and app.description directly into innerHTML, creating XSS vulnerabilities. Apply sanitization or use safer DOM methods.

Also applies to: 252-260, 267-275, 282-290, 348-357, 363-371


231-292: Apply category and type filters to API calls.

The loadMainContent method ignores this.currentCategory and this.currentType, so filter buttons and type dropdown have no effect. Pass these parameters to every getApps call.

Apply this diff:

-        const apps = await this.api.getApps({ limit: 8 });
+        const params = { limit: 8 };
+        if (this.currentCategory && this.currentCategory !== 'all') {
+            params.category = this.currentCategory;
+        }
+        if (this.currentType) {
+            params.type = this.currentType;
+        }
+        const apps = await this.api.getApps(params);

Also applies to: 279-279, 377-377

docs/md_v2/marketplace/backend/server.py (1)

97-108: Stop interpolating request data into SQL.

The get_apps handler builds WHERE clauses by f-stringing user-supplied values (category, type, featured, sponsored) directly into SQL, enabling trivial SQL injection. Switch to parameterized queries.

Update DatabaseManager.get_all to accept a params list and use ? placeholders:

def get_all(self, table: str, limit: int = 100, offset: int = 0, where: str = None, params: list = None) -> List[Dict]:
    cursor = self.conn.cursor()
    query = f"SELECT * FROM {table}"
    query_params = []
    if where:
        query += f" WHERE {where}"
        if params:
            query_params.extend(params)
    query += f" LIMIT ? OFFSET ?"
    query_params.extend([limit, offset])
    cursor.execute(query, query_params)
    rows = cursor.fetchall()
    return [dict(row) for row in rows]

Then in the handler:

where_clauses = []
params = []
if category:
    where_clauses.append("category = ?")
    params.append(category)
if type:
    where_clauses.append("type = ?")
    params.append(type)
# ... and so on
where = " AND ".join(where_clauses) if where_clauses else None
apps = db.get_all('apps', limit=limit, offset=offset, where=where, params=params)

Also applies to: 120-120, 137-137, 152-152, 176-176

docs/md_v2/marketplace/admin/admin.js (1)

190-213: Handle 204/empty responses before calling .json().

The apiCall method unconditionally parses JSON. DELETE operations (or other 204 responses) will throw "Unexpected end of JSON input", preventing deletions from working.

Apply this fix:

         if (response.status === 401) {
             this.logout();
             throw { status: 401 };
         }

         if (!response.ok) throw new Error(`API Error: ${response.status}`);
-        return response.json();
+
+        if (response.status === 204 || response.headers.get('Content-Length') === '0') {
+            return null;
+        }
+
+        const text = await response.text();
+        return text ? JSON.parse(text) : null;
     }
🧹 Nitpick comments (2)
docs/md_v2/marketplace/marketplace.js (1)

237-237: Avoid inline onclick with stringified data.

Using JSON.stringify().replace(/"/g, '&quot;') for inline handlers is fragile. Consider using event delegation or data attributes to avoid potential XSS and improve maintainability.

Example using data attributes:

-                <div class="app-compact" onclick="marketplace.showAppDetail(${JSON.stringify(app).replace(/"/g, '&quot;')})">
+                <div class="app-compact" data-app-id="${app.id}">

Then attach a delegated click handler:

document.getElementById('apps-grid').addEventListener('click', (e) => {
    const card = e.target.closest('[data-app-id]');
    if (card) {
        const app = this.data.apps.find(a => a.id == card.dataset.appId);
        if (app) this.showAppDetail(app);
    }
});

Also applies to: 268-268, 283-283

docs/md_v2/marketplace/admin/admin.js (1)

289-291: Consider event delegation for table actions.

While using numeric IDs in inline onclick handlers is safe here, event delegation would be more maintainable and eliminate any future XSS risk if string data is used.

Example:

// In setupEventListeners or after rendering:
document.getElementById('apps-table').addEventListener('click', (e) => {
    if (e.target.matches('.btn-edit')) {
        const row = e.target.closest('tr');
        const id = parseInt(row.dataset.id, 10);
        this.editItem('apps', id);
    }
    // Similar for duplicate and delete
});

Then update the table row:

<tr data-id="${app.id}">
    ...
    <button class="btn-edit">Edit</button>

Also applies to: 327-329, 361-362, 400-401

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9900f63 and 611d48f.

📒 Files selected for processing (11)
  • docs/blog/release-v0.7.4.md (0 hunks)
  • docs/md_v2/assets/page_actions.css (1 hunks)
  • docs/md_v2/assets/page_actions.js (1 hunks)
  • docs/md_v2/marketplace/admin/admin.css (1 hunks)
  • docs/md_v2/marketplace/admin/admin.js (1 hunks)
  • docs/md_v2/marketplace/admin/index.html (1 hunks)
  • docs/md_v2/marketplace/backend/server.py (1 hunks)
  • docs/md_v2/marketplace/backend/uploads/.gitignore (1 hunks)
  • docs/md_v2/marketplace/marketplace.css (1 hunks)
  • docs/md_v2/marketplace/marketplace.js (1 hunks)
  • mkdocs.yml (4 hunks)
💤 Files with no reviewable changes (1)
  • docs/blog/release-v0.7.4.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • docs/md_v2/marketplace/admin/index.html
  • docs/md_v2/marketplace/admin/admin.css
  • docs/md_v2/assets/page_actions.js
  • mkdocs.yml
🧰 Additional context used
🧬 Code graph analysis (3)
docs/md_v2/marketplace/admin/admin.js (4)
docs/md_v2/marketplace/marketplace.js (4)
  • window (2-9)
  • window (3-3)
  • origin (5-5)
  • resolveAssetUrl (11-18)
docs/md_v2/marketplace/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/marketplace.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/marketplace.js (4)
docs/md_v2/marketplace/admin/admin.js (2)
  • window (24-24)
  • resolveAssetUrl (39-46)
docs/md_v2/marketplace/frontend/marketplace.js (3)
  • CACHE_TTL (3-3)
  • API_BASE (2-2)
  • marketplace (392-392)
docs/md_v2/marketplace/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/frontend/app-detail.js (1)
  • API_BASE (2-2)
docs/md_v2/marketplace/backend/server.py (2)
docs/md_v2/marketplace/backend/database.py (3)
  • DatabaseManager (7-117)
  • get_all (80-89)
  • search (91-113)
docs/md_v2/marketplace/backend/config.py (1)
  • Config (30-59)
🪛 Ruff (0.13.3)
docs/md_v2/marketplace/backend/server.py

222-222: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


231-231: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


307-307: Possible SQL injection vector through string-based query construction

(S608)


310-310: Consider moving this statement to an else block

(TRY300)


311-311: Do not catch blind exception: Exception

(BLE001)


312-312: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


325-325: Possible SQL injection vector through string-based query construction

(S608)


326-326: Consider [*list(app_data.values()), app_id] instead of concatenation

Replace with [*list(app_data.values()), app_id]

(RUF005)


328-328: Consider moving this statement to an else block

(TRY300)


329-329: Do not catch blind exception: Exception

(BLE001)


330-330: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


352-352: Possible SQL injection vector through string-based query construction

(S608)


355-355: Consider moving this statement to an else block

(TRY300)


356-356: Do not catch blind exception: Exception

(BLE001)


357-357: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


369-369: Possible SQL injection vector through string-based query construction

(S608)


370-370: Consider [*list(article_data.values()), article_id] instead of concatenation

Replace with [*list(article_data.values()), article_id]

(RUF005)


372-372: Consider moving this statement to an else block

(TRY300)


373-373: Do not catch blind exception: Exception

(BLE001)


374-374: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


395-395: Possible SQL injection vector through string-based query construction

(S608)


398-398: Consider moving this statement to an else block

(TRY300)


399-399: Do not catch blind exception: Exception

(BLE001)


400-400: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


412-412: Possible SQL injection vector through string-based query construction

(S608)


413-413: Consider [*list(category_data.values()), cat_id] instead of concatenation

Replace with [*list(category_data.values()), cat_id]

(RUF005)


415-415: Consider moving this statement to an else block

(TRY300)


416-416: Do not catch blind exception: Exception

(BLE001)


417-417: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


427-427: Consider moving this statement to an else block

(TRY300)


428-428: Do not catch blind exception: Exception

(BLE001)


429-429: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


439-439: Possible SQL injection vector through string-based query construction

(S608)


442-442: Consider moving this statement to an else block

(TRY300)


443-443: Do not catch blind exception: Exception

(BLE001)


444-444: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


452-452: Possible SQL injection vector through string-based query construction

(S608)


453-453: Consider [*list(sponsor_data.values()), sponsor_id] instead of concatenation

Replace with [*list(sponsor_data.values()), sponsor_id]

(RUF005)


455-455: Consider moving this statement to an else block

(TRY300)


456-456: Do not catch blind exception: Exception

(BLE001)


457-457: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


467-467: Consider moving this statement to an else block

(TRY300)


468-468: Do not catch blind exception: Exception

(BLE001)


469-469: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (7)
docs/md_v2/marketplace/backend/server.py (3)

230-252: LGTM: Secure file upload implementation.

The upload endpoint properly validates folder whitelist, content type, and file size. The secure filename generation using timestamp and token_hex prevents path traversal and collision attacks.


332-338: LGTM: Parameterized DELETE query.

The delete endpoint uses parameterized queries correctly, preventing SQL injection.

Also applies to: 376-382, 420-429, 460-469


63-83: LGTM: Robust type coercion utility.

The to_int helper safely handles various input types (bool, int, float, str) with proper fallbacks. Good defensive programming.

docs/md_v2/marketplace/admin/admin.js (4)

2-37: LGTM: Flexible API origin configuration.

The API configuration logic supports URL parameter overrides with localStorage persistence, providing flexibility for development and testing scenarios. The fallback logic is robust.


648-693: LGTM: Well-structured save flow with image upload.

The saveItem method properly handles sponsor logo upload before saving, collects form data, and refreshes the UI after successful operations. Error handling is appropriate.


744-799: LGTM: Comprehensive file upload UI handling.

The logo upload handlers properly manage state transitions, show previews using FileReader, and handle both existing and new file scenarios. Good UX implementation.


813-824: LGTM: Referential integrity check before deletion.

The deleteCategory method prevents deletion of categories that have associated apps, maintaining data integrity. This is good defensive programming.

Comment on lines +307 to +312
cursor.execute(f"INSERT INTO apps ({columns}) VALUES ({placeholders})",
list(app_data.values()))
db.conn.commit()
return {"id": cursor.lastrowid, "message": "App created"}
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix SQL construction and exception handling.

While line 307 uses ? placeholders for values, constructing the column list from dict keys is still risky if those keys could be influenced by user input. Additionally, catching bare Exception and raising without chaining loses context.

Apply this diff to improve exception handling:

         cursor.execute(f"INSERT INTO apps ({columns}) VALUES ({placeholders})",
                       list(app_data.values()))
         db.conn.commit()
         return {"id": cursor.lastrowid, "message": "App created"}
     except Exception as e:
-        raise HTTPException(status_code=400, detail=str(e))
+        raise HTTPException(status_code=400, detail=str(e)) from e

Also applies to: 329-330, 356-357, 373-374, 399-400, 416-417, 443-444, 456-457, 468-469

🧰 Tools
🪛 Ruff (0.13.3)

307-307: Possible SQL injection vector through string-based query construction

(S608)


310-310: Consider moving this statement to an else block

(TRY300)


311-311: Do not catch blind exception: Exception

(BLE001)


312-312: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Comment on lines +325 to +330
cursor.execute(f"UPDATE apps SET {set_clause} WHERE id = ?",
list(app_data.values()) + [app_id])
db.conn.commit()
return {"message": "App updated"}
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use parameterized UPDATE and improve list construction.

Line 325 builds the SET clause from dict keys, which is risky. Additionally, line 326 concatenates lists instead of using unpacking.

Apply these fixes:

         set_clause = ', '.join([f"{k} = ?" for k in app_data.keys()])
         cursor = db.conn.cursor()
         cursor.execute(f"UPDATE apps SET {set_clause} WHERE id = ?",
-                      list(app_data.values()) + [app_id])
+                      [*list(app_data.values()), app_id])
         db.conn.commit()
         return {"message": "App updated"}
     except Exception as e:
-        raise HTTPException(status_code=400, detail=str(e))
+        raise HTTPException(status_code=400, detail=str(e)) from e

Also applies to: 369-374, 412-417, 452-457

🧰 Tools
🪛 Ruff (0.13.3)

325-325: Possible SQL injection vector through string-based query construction

(S608)


326-326: Consider [*list(app_data.values()), app_id] instead of concatenation

Replace with [*list(app_data.values()), app_id]

(RUF005)


328-328: Consider moving this statement to an else block

(TRY300)


329-329: Do not catch blind exception: Exception

(BLE001)


330-330: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
In docs/md_v2/marketplace/backend/server.py around lines 325-330, the UPDATE
builds the SET clause unsafely from dict keys and constructs the parameter list
by concatenating lists; change to build a parameterized SET clause like ",
".join(f"{k} = ?" for k in app_data.keys()) and construct the parameters using
unpacking or append so the order matches the keys (e.g. params =
[*app_data.values(), app_id] or params = list(app_data.values());
params.append(app_id)), then call cursor.execute("UPDATE apps SET {set_clause}
WHERE id = ?", params) and commit; apply the same fix pattern to the other
occurrences at lines 369-374, 412-417, and 452-457.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants