Skip to content

Conversation

YorelN
Copy link

@YorelN YorelN commented Sep 25, 2025

Summary

This PR updates the crawl request handling to correctly apply the PDFCrawlerStrategy and PDFContentScrapingStrategy only when the URLs point to PDF files. Previously, all URLs were treated as web pages by default, which caused PDFs to be misprocessed.

  • Ensures that PDFContentScrapingStrategy is used for PDF URLs when no custom scraping strategy is provided.
  • Leaves non-PDF URLs to be handled by the default LXMLWebScrapingStrategy.
  • Updates both streaming and non-streaming crawl request handlers.

Fixes issues where PDFs were downloaded but not processed correctly, e.g., missing content extraction.

List of files changed and why

  • api.py – Modified handle_crawl_request and handle_stream_crawl_request to dynamically apply PDFCrawlerStrategy and PDFContentScrapingStrategy for PDF URLs.
  • __init__.py - Export PDFCrawlerStrategy, PDFContentScrapingStrategy
  • crawler_pool.py - Edit _sig() and get_crawler to handle crawler_strategy
  • demo_docker_api.py - Added examples/demo to illustrate PDF crawling usage.
  • utils.py - Added is_pdf_url to be able to update the strategies

How Has This Been Tested?

  • Tested crawling a PDF URL (https://arxiv.org/pdf/2310.06825) with and without a custom scraping strategy.
  • Verified that the PDF content is correctly extracted and returned in the response.
  • Non-PDF URLs continue to be processed with the standard web scraping strategy.
  • Unit tests for serialization and PDF content handling were run locally.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Summary by CodeRabbit

  • New Features

    • Automatic PDF detection and dedicated PDF crawling/content extraction for regular and streaming endpoints; PDF content included in results (base64 when applicable) and returned in JSON.
  • Maintenance

    • Strategy-aware routing for PDF workloads, crawler pool lifecycle improvements with idle pruning and bulk close, and consistent result serialization.
  • Chores

    • Package API expanded to expose PDF crawling/scraping utilities and PDF URL detection helper.
  • Documentation

    • Demo examples showing PDF crawl and streaming usage.
  • Tests

    • End-to-end tests covering PDF extraction, metadata, and inaccessible-PDF handling.

Copy link
Contributor

coderabbitai bot commented Sep 25, 2025

Walkthrough

Adds PDF-aware crawling: exposes PDF crawler/scraping strategies; API detects PDF URLs and selects PDF-specific strategies, encodes PDF content for JSON; crawler pool becomes strategy-aware and gains lifecycle helpers; adds async PDF detection utility, demo examples, and end-to-end PDF tests.

Changes

Cohort / File(s) Summary of Changes
Public API exports
crawl4ai/__init__.py
Imports and exposes PDFCrawlerStrategy and PDFContentScrapingStrategy in __all__.
API: PDF detection, strategy selection & serialization
deploy/docker/api.py
Detects PDF URLs via is_pdf_url, rejects mixed PDF/non-PDF sets, selects/forwards crawler_strategy (PDF vs non-PDF), defaults scraping_strategy to PDFContentScrapingStrategy when appropriate, normalizes result shapes, base64-encodes PDF bytes, and uses jsonable_encoder; imports updated.
Crawler pool: strategy-aware pooling & maintenance
deploy/docker/crawler_pool.py
Adds optional crawler_strategy to _sig and get_crawler, includes strategy in pool signature and crawler instantiation, refines pool cleanup, and adds close_all() and janitor() helpers for lifecycle maintenance.
PDF detection utility
deploy/docker/utils.py
Adds async def is_pdf_url(url: str) -> bool that heuristically checks extension, HEAD Content-Type, and ranged GET magic-bytes using httpx; returns False on errors.
Docs: PDF demos
docs/examples/docker/demo_docker_api.py
Adds PDF_URL constant and demo functions demo_pdf_crawl and demo_pdf_crawl_stream showing PDF crawl/stream usage with PDFContentScrapingStrategy; demos are available but not enabled by default.
Tests: REST PDF crawl
tests/docker/test_rest_api_pdf_crawl.py
New end-to-end tests for PDF crawling covering successful extraction and non-accessible PDF behavior, with health checks, fixtures, and result-structure assertions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant API as API (/crawl)
  participant Utils as utils.is_pdf_url
  participant Pool as crawler_pool.get_crawler
  participant Crawler as AsyncWebCrawler
  participant Encoder as jsonable_encoder

  Client->>API: POST /crawl { urls, scraping_strategy? }
  API->>Utils: is_pdf_url(urls...)
  Utils-->>API: PDFs present? (bool)
  alt PDF detected
    API->>API: set crawler_strategy = PDFCrawlerStrategy
    API->>API: default scraping_strategy = PDFContentScrapingStrategy (if none)
  else Not PDF
    API->>API: use provided/default non-PDF strategy
  end
  API->>Pool: get_crawler(browser_config, crawler_strategy?)
  Pool-->>API: AsyncWebCrawler instance
  API->>Crawler: crawl(urls, scraping_strategy, options)
  Crawler-->>API: results (list/dict, PDFs as bytes)
  API->>API: base64-encode PDF bytes in results
  API->>Encoder: jsonable_encoder(results)
  API-->>Client: JSON response
Loading
sequenceDiagram
  autonumber
  participant Client
  participant API as API (/stream/crawl)
  participant Utils as utils.is_pdf_url
  participant Pool as crawler_pool.get_crawler
  participant Crawler as AsyncWebCrawler
  participant Encoder as jsonable_encoder

  Client->>API: POST /stream/crawl { urls, ... }
  API->>Utils: is_pdf_url(urls...)
  Utils-->>API: PDFs present? (bool)
  API->>Pool: get_crawler(browser_config, crawler_strategy?)
  Pool-->>API: AsyncWebCrawler
  API->>Crawler: stream_crawl(...)
  loop per event
    Crawler-->>API: partial result
    API->>API: base64-encode PDF bytes (if any)
    API->>Encoder: jsonable_encoder(partial)
    API-->>Client: event chunk
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through links and pages blue,
sniff for "%PDF" in every view.
Strategies set, pool doors swing wide,
bytes turn base64 for the ride.
A rabbit's cheers for the crawl inside. 🐇📄

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “Feat/docker PDF strategy” mentions the addition of a PDF strategy in the Docker context but lacks a clear verb and does not fully convey the change’s intent or scope, making it ambiguous for readers scanning the history. Please revise the title to a concise, descriptive sentence that clearly states what was added or changed, for example “Add PDF crawling strategy to Docker API endpoints.”
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed The provided description matches the repository’s template by including a clear summary of the change, a detailed list of files modified with reasons, a thorough testing section, and a completed checklist verifying style, documentation, and tests.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
docs/examples/docker/demo_docker_api.py (1)

21-26: Fix duplicate assignments of BASE_URL and SIMPLE_URL.

Second assignments override the first and can confuse users.

-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020")
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235")
+BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235")
@@
-SIMPLE_URL = "https://example.com"  # For demo purposes
-SIMPLE_URL = "https://httpbin.org/html"
+SIMPLE_URL = "https://httpbin.org/html"
deploy/docker/api.py (3)

475-501: Single-URL path returns a non-iterable; serialization shape changed unexpectedly.

  • arun returns a single result (not iterable); iterating over it is a bug.
  • Wrapping with to_serializable_dict changes response structure ({"type":"dict","value":...}). Streaming still returns plain dicts — inconsistent API.

Apply this diff to fix both:

-        results = await partial_func()
+        results = await partial_func()
+        results_list = results if isinstance(results, list) else [results]
@@
-        for result in results:
+        for result in results_list:
             result_dict = result.model_dump()
             # If PDF exists, encode it to base64
             if result_dict.get('pdf') is not None:
                 result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
-
-            processed_results.append(to_serializable_dict(result_dict))
+            # Keep response shape consistent with streaming (plain JSON-serializable dict)
+            processed_results.append(jsonable_encoder(result_dict))

559-564: Respect rate limiter enabled flag (parity with non-streaming).

Streaming path always enables the rate limiter; non-streaming honors config flag.

-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
-            rate_limiter=RateLimiter(
-                base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
-            )
-        )
+        dispatcher = MemoryAdaptiveDispatcher(
+            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
+            rate_limiter=RateLimiter(
+                base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
+            ) if config["crawler"]["rate_limiter"]["enabled"] else None
+        )

541-546: Normalize URLs (add scheme) in streaming path.

Non-streaming adds https:// when missing; streaming does not, causing failures.

         browser_config = BrowserConfig.load(browser_config)
@@
         crawler_config = CrawlerRunConfig.load(crawler_config)
         crawler_config.stream = True
+
+        # Normalize URLs to include scheme (match non-streaming behavior)
+        urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]
🧹 Nitpick comments (11)
deploy/docker/crawler_pool.py (4)

29-31: Use a non-deprecated hash and avoid security-linter noise.

sha1 is flagged as insecure. While used as a key, prefer sha256 for hygiene.

-    return hashlib.sha1(json_payload.encode()).hexdigest()
+    return hashlib.sha256(json_payload.encode()).hexdigest()

41-47: Normalize message punctuation and avoid duplicate wrapping.

  • Replace EN DASH with hyphen to avoid linter noise.
  • The except MemoryError re-wrap adds redundant prefix.
-            if psutil.virtual_memory().percent >= MEM_LIMIT:
-                raise MemoryError("RAM pressure – new browser denied")
+            if psutil.virtual_memory().percent >= MEM_LIMIT:
+                raise MemoryError("RAM pressure - new browser denied")
@@
-    except MemoryError as e:
-        raise MemoryError(f"RAM pressure – new browser denied: {e}")
+    except MemoryError as e:
+        raise

37-43: Avoid heavy work under the global lock (scalability).

Creating/starting the crawler under LOCK blocks all callers. Use a two-phase approach (check/insert placeholder under lock, create/start outside, finalize under lock), or store an asyncio.Task/Future in POOL to dedupe concurrent requests for the same signature.


38-38: Remove stray semicolon.

Style/lint nit.

-                LAST_USED[sig] = time.time();  
+                LAST_USED[sig] = time.time()
docs/examples/docker/demo_docker_api.py (3)

1289-1296: Align PDF demo output with API payload shape.

The PDF result typically exposes extracted_content; text may be nested. Use a safe fallback.

-    if data.get("results"):
-        first = data["results"][0]
-        text_snippet = (first.get("text") or "")[:500]
-        print("Extracted text (first 500 chars):")
-        print(text_snippet)
+    if data.get("results"):
+        first = data["results"][0]
+        text = first.get("extracted_content") or first.get("text") or ""
+        if isinstance(text, dict):
+            text = text.get("text") or ""
+        print("Extracted text (first 500 chars):")
+        print((text or "")[:500])

1311-1317: Comment contradicts code.

You're explicitly setting a PDF strategy; remove “Default strategy if not set” in the comment.


30-30: Nit: add .pdf suffix for clarity (optional).

The detector handles content-type; adding .pdf can avoid an extra HEAD/GET sometimes.

tests/docker/test_rest_api_pdf_crawl.py (2)

15-16: Remove duplicate BASE_URL assignment.

Second line overrides the first; keep one.

-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235") # If server is running in Docker, use the host's IP
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020") # If server is running in dev debug mode
+BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020")

99-103: Be robust to string vs dict extracted_content.

API may return extracted_content as a string. Guard before .get.

-        extracted_text = result["extracted_content"].get("text", "")
-        assert isinstance(extracted_text, str)
-        assert len(extracted_text) > 0
+        content = result.get("extracted_content")
+        extracted_text = content.get("text", "") if isinstance(content, dict) else (content or "")
+        assert isinstance(extracted_text, str) and len(extracted_text) > 0
deploy/docker/api.py (2)

447-454: Default PDF scraping strategies differ between non-stream and stream.

Non-stream sets extract_images=False; streaming sets True. If unintentional, align defaults.


492-500: Consider using jsonable_encoder for nested non-JSON types.

If model_dump contains datetimes/enums, jsonable_encoder ensures proper conversion.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and c709082.

📒 Files selected for processing (6)
  • crawl4ai/__init__.py (2 hunks)
  • deploy/docker/api.py (7 hunks)
  • deploy/docker/crawler_pool.py (3 hunks)
  • deploy/docker/utils.py (2 hunks)
  • docs/examples/docker/demo_docker_api.py (3 hunks)
  • tests/docker/test_rest_api_pdf_crawl.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
deploy/docker/api.py (4)
deploy/docker/utils.py (1)
  • is_pdf_url (130-155)
crawl4ai/processors/pdf/__init__.py (2)
  • PDFCrawlerStrategy (10-29)
  • PDFContentScrapingStrategy (31-192)
crawl4ai/async_configs.py (1)
  • to_serializable_dict (44-115)
deploy/docker/crawler_pool.py (1)
  • get_crawler (33-57)
crawl4ai/__init__.py (1)
crawl4ai/processors/pdf/__init__.py (2)
  • PDFCrawlerStrategy (10-29)
  • PDFContentScrapingStrategy (31-192)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
  • AsyncWebCrawler (53-852)
crawl4ai/async_configs.py (1)
  • BrowserConfig (329-621)
deploy/docker/utils.py (1)
  • load_config (23-41)
🪛 Ruff (0.13.1)
tests/docker/test_rest_api_pdf_crawl.py

28-28: Consider moving this statement to an else block

(TRY300)

deploy/docker/utils.py

152-152: Do not catch blind exception: Exception

(BLE001)

deploy/docker/crawler_pool.py

30-30: Probable use of insecure hash functions in hashlib: sha1

(S324)


38-38: Statement ends with an unnecessary semicolon

Remove unnecessary semicolon

(E703)


41-41: Abstract raise to an inner function

(TRY301)


41-41: Avoid specifying long messages outside the exception class

(TRY003)


41-41: String contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

🔇 Additional comments (2)
crawl4ai/__init__.py (2)

14-15: LGTM: Publicly exporting PDF strategies.

Export additions look correct and match usages in API and demos.

Please confirm crawl4ai/processors/pdf/init.py is part of the installable package (MANIFEST/pyproject) so these symbols are always available.


134-136: LGTM: all includes new PDF symbols.

Consistent with imports.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c709082 and 481e7fe.

📒 Files selected for processing (1)
  • tests/docker/test_rest_api_pdf_crawl.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
tests/docker/test_rest_api_pdf_crawl.py

28-28: Consider moving this statement to an else block

(TRY300)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
deploy/docker/utils.py (1)

121-129: Avoid blind exception catch in verify_email_domain.

Catching Exception hides real issues; narrow to dns.resolver.* and log.

-    except Exception as e:
-        return False
+    except (dns.resolver.NoAnswer, dns.resolver.NXDOMAIN, dns.resolver.NoNameservers, dns.exception.DNSException):
+        return False
docs/examples/docker/demo_docker_api.py (1)

1287-1297: Use rich Console for output to stay consistent with the rest of the demo.

Replace print calls with console.print panels/snippets.

tests/docker/test_rest_api_pdf_crawl.py (1)

101-116: Add a test for mixed URL types (should 400).

API now rejects mixed PDF/non-PDF batches; add coverage.

Proposed test:

@pytest.mark.asyncio
async def test_pdf_mixed_urls_rejected(async_client: httpx.AsyncClient):
    payload = {
        "urls": [PDF_TEST_URL, "https://example.com"],
        "browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
        "crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "BYPASS"}}
    }
    resp = await async_client.post("/crawl", json=payload)
    assert resp.status_code == 400
    detail = resp.json().get("detail", "")
    assert "Mix of PDF and non-PDF URLs" in detail
deploy/docker/crawler_pool.py (2)

18-31: Use fully-qualified strategy identifier in pool signature.

Class name alone can collide across modules; hash module + qualname.

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
+    if crawler_strategy is not None:
+        cls = crawler_strategy.__class__
+        payload["strategy"] = f"{cls.__module__}.{cls.__qualname__}"

33-50: Exception handling nits: remove unused var, preserve traceback.

Tidy per Ruff BLE001/B904 and avoid semicolons.

-            POOL[sig] = crawler; LAST_USED[sig] = time.time()
+            POOL[sig] = crawler
+            LAST_USED[sig] = time.time()
             return crawler
-    except MemoryError as e:
-        raise
-    except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+    except MemoryError:
+        raise
+    except Exception as e:
+        raise RuntimeError(f"Failed to start browser: {e}") from e
deploy/docker/api.py (2)

443-452: Compute is_pdf with all(...) for clarity and correctness.

You already reject mixed sets; using all() communicates intent and avoids confusion.

-        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
-        is_pdf = any(is_pdf_flags)
+        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
+        is_pdf = all(is_pdf_flags)
         if any(is_pdf_flags) and not all(is_pdf_flags):

554-566: Apply all(...) in streaming path too to mirror non-streaming logic.

Keeps behavior consistent.

-        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
-        is_pdf = any(is_pdf_flags)
+        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
+        is_pdf = all(is_pdf_flags)
         if any(is_pdf_flags) and not all(is_pdf_flags):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 726f41a and 1636383.

📒 Files selected for processing (5)
  • deploy/docker/api.py (7 hunks)
  • deploy/docker/crawler_pool.py (2 hunks)
  • deploy/docker/utils.py (2 hunks)
  • docs/examples/docker/demo_docker_api.py (3 hunks)
  • tests/docker/test_rest_api_pdf_crawl.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/docker/api.py (3)
deploy/docker/utils.py (1)
  • is_pdf_url (130-157)
crawl4ai/processors/pdf/__init__.py (2)
  • PDFCrawlerStrategy (10-29)
  • PDFContentScrapingStrategy (31-192)
deploy/docker/crawler_pool.py (1)
  • get_crawler (33-59)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
  • AsyncWebCrawler (53-852)
crawl4ai/async_configs.py (1)
  • BrowserConfig (329-621)
deploy/docker/utils.py (1)
  • load_config (23-41)
🪛 Ruff (0.13.1)
deploy/docker/api.py

446-449: Abstract raise to an inner function

(TRY301)


560-563: Abstract raise to an inner function

(TRY301)

deploy/docker/crawler_pool.py

42-42: Abstract raise to an inner function

(TRY301)


42-42: Avoid specifying long messages outside the exception class

(TRY003)


45-45: Multiple statements on one line (semicolon)

(E702)


47-47: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


49-49: Do not catch blind exception: Exception

(BLE001)


50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


50-50: Avoid specifying long messages outside the exception class

(TRY003)

tests/docker/test_rest_api_pdf_crawl.py

30-30: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (1)
deploy/docker/api.py (1)

498-508: LGTM: JSON-safe result shaping with base64 PDF bytes.

Good normalization and use of jsonable_encoder maintains streaming/non-streaming parity.

Please confirm no nested Pydantic models remain in result_dict that would need model_dump before encoding.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
docs/examples/docker/demo_docker_api.py (6)

28-28: Make PDF_URL configurable via env for reliability.

Allow overriding the demo PDF URL without code edits (useful behind firewalls or when the remote file changes).

Apply:

-PDF_URL = "https://arxiv.org/pdf/2310.06825.pdf" # For PDF demo
+PDF_URL = os.getenv("PDF_URL", "https://arxiv.org/pdf/2310.06825.pdf")  # For PDF demo

1265-1283: Optionally showcase backend auto-detection by omitting scraping_strategy.

Since the API now selects PDF strategies for .pdf URLs when no custom strategy is provided, add a variant (or toggle) that relies on auto‑detection to demonstrate the new default behavior.

Apply:

         "crawler_config": {
             "type": "CrawlerRunConfig",
             "params": {
                 "cache_mode": "BYPASS",
-                "scraping_strategy": {
-                    "type": "PDFContentScrapingStrategy",
-                    "params": {
-                        "extract_images": False,
-                        "save_images_locally": False,
-                        "batch_size": 2
-                    }
-                }
             }
         }

1285-1298: Reuse make_request for consistency and richer error output.

Unifies logging, timing, and error handling with other demos.

Apply:

-    resp = await client.post("/crawl", json=payload)
-    resp.raise_for_status()
-    data = resp.json()
-    print("=== Demo: PDF Crawl ===")
-    print("Success:", data.get("success"))
-    print("Number of results:", len(data.get("results", [])))
-    if data.get("results"):
-        first = data["results"][0]
-        text = first.get("extracted_content") or first.get("text") or ""
-        if isinstance(text, dict):
-            text = text.get("text") or ""
-        print("Extracted text (first 500 chars):")
-        print((text or "")[:500])
+    results = await make_request(client, "/crawl", payload, "Demo PDF: Crawl")
+    if results:
+        first = results[0]
+        text = first.get("extracted_content") or first.get("text") or ""
+        if isinstance(text, dict):
+            text = text.get("text") or ""
+        console.print(Panel((text or "")[:500], title="Extracted text (first 500 chars)",
+                            border_style="cyan", expand=False))

1306-1324: Optionally rely on auto-detection in streaming demo too.

Mirror the non‑streaming suggestion to demonstrate default PDF handling.

Apply:

             "params": {
                 "stream": True,
                 "cache_mode": "BYPASS",
-                "scraping_strategy": {
-                    "type": "PDFContentScrapingStrategy",
-                    "params": {
-                        "extract_images": False,     
-                        "save_images_locally": False,
-                        "batch_size": 2              
-                    }
-                }
             }

1326-1331: Enhance streaming UX: show text chunks when present.

Extend stream_request to print short snippets when a streamed line includes extracted text (common for PDFs).

You can update stream_request (outside this hunk) along these lines:

# inside the async for line in response.aiter_lines() loop, after parsing `data`
if isinstance(data, dict):
    text_chunk = data.get("extracted_content") or data.get("text")
    if isinstance(text_chunk, dict):
        text_chunk = text_chunk.get("text")
    if isinstance(text_chunk, str) and text_chunk.strip():
        snippet = text_chunk[:200].replace("\n", " ")
        console.print(f"    [cyan]Text chunk:[/] {snippet}...")

1366-1369: Gate PDF demos with an env toggle instead of comments.

Keeps them off by default but easy to enable in CI/local runs.

Apply:

-        # await demo_pdf_crawl_stream(client)
-        # await demo_pdf_crawl(client)
+        if os.getenv("RUN_PDF_DEMOS") == "1":
+            await demo_pdf_crawl_stream(client)
+            await demo_pdf_crawl(client)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1636383 and ec2f88a.

📒 Files selected for processing (2)
  • deploy/docker/utils.py (2 hunks)
  • docs/examples/docker/demo_docker_api.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deploy/docker/utils.py
🔇 Additional comments (3)
docs/examples/docker/demo_docker_api.py (3)

1265-1283: browser_config now present — resolves prior validation issue.

This addresses the earlier 422 risk and aligns with other demos.


1306-1324: browser_config present in streaming payload — resolved.

This fixes the previously flagged schema mismatch for streaming.


1265-1283: Ignore addition of crawler_strategy in demo payload
The REST handler doesn’t accept a crawler_strategy field—it's determined server-side from the PDF flag.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
deploy/docker/crawler_pool.py (6)

18-30: Make strategy fingerprint more robust to avoid pool key collisions.

Hashing only the class name may collide across differently configured instances of the same strategy. Prefer using a stable per-strategy signature or config when available, with class name as a fallback.

Apply this diff:

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
+    if crawler_strategy is not None:
+        # Prefer a stable, user-provided signature/config if present
+        strat_sig = getattr(crawler_strategy, "signature", None)
+        if callable(strat_sig):
+            try:
+                strat_sig = strat_sig()
+            except Exception:
+                strat_sig = None
+        if strat_sig is None:
+            strat_cfg = getattr(crawler_strategy, "to_dict", None)
+            if callable(strat_cfg):
+                with suppress(Exception):
+                    strat_sig = strat_cfg()
+        payload["strategy"] = strat_sig or crawler_strategy.__class__.__name__

47-51: Tighten exception handling and preserve the original cause.

Remove the unused variable, avoid blind catch if possible, and chain the original error.

Apply this diff:

-    except MemoryError as e:
-        raise
-    except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+    except MemoryError:
+        raise
+    except Exception as e:
+        # Consider narrowing the exception types if feasible.
+        raise RuntimeError("Failed to start browser") from e

45-45: Split statements; fix Ruff E702.

Avoid multiple statements on one line.

Apply this diff:

-            POOL[sig] = crawler; LAST_USED[sig] = time.time()
+            POOL[sig] = crawler
+            LAST_USED[sig] = time.time()

60-64: Close crawlers outside the lock to avoid blocking get_crawler.

Awaiting .close() while holding LOCK can stall all callers.

Apply this diff:

-async def close_all():
-    async with LOCK:
-        await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True)
-        POOL.clear(); LAST_USED.clear()
+async def close_all():
+    async with LOCK:
+        crawlers = list(POOL.values())
+        POOL.clear()
+        LAST_USED.clear()
+    await asyncio.gather(*(c.close() for c in crawlers), return_exceptions=True)

65-73: Avoid awaiting close() inside janitor’s critical section.

Collect stale entries under LOCK, release it for the close awaits, then clean maps under LOCK. Also fix semicolons.

Apply this diff:

 async def janitor():
     while True:
         await asyncio.sleep(60)
         now = time.time()
-        async with LOCK:
-            for sig, crawler in list(POOL.items()):
-                if now - LAST_USED[sig] > IDLE_TTL:
-                    with suppress(Exception): await crawler.close()
-                    POOL.pop(sig, None); LAST_USED.pop(sig, None)
+        to_close = []
+        async with LOCK:
+            for sig, crawler in list(POOL.items()):
+                if now - LAST_USED.get(sig, 0) > IDLE_TTL:
+                    to_close.append((sig, crawler))
+        for sig, crawler in to_close:
+            with suppress(Exception):
+                await crawler.close()
+        async with LOCK:
+            for sig, _ in to_close:
+                POOL.pop(sig, None)
+                LAST_USED.pop(sig, None)

33-33: Type hint crawler_strategy to the expected protocol/type.

Optional[object] is too broad; prefer the concrete AsyncCrawlerStrategy (or a minimal Protocol exposing signature()/to_dict()) to improve correctness and editor support.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec2f88a and 9681567.

📒 Files selected for processing (1)
  • deploy/docker/crawler_pool.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
  • AsyncWebCrawler (53-852)
crawl4ai/async_configs.py (1)
  • BrowserConfig (329-621)
deploy/docker/utils.py (1)
  • load_config (23-41)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py

42-42: Abstract raise to an inner function

(TRY301)


42-42: Avoid specifying long messages outside the exception class

(TRY003)


45-45: Multiple statements on one line (semicolon)

(E702)


47-47: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


49-49: Do not catch blind exception: Exception

(BLE001)


50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


50-50: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
deploy/docker/crawler_pool.py (1)

52-59: Good fix for prior UnboundLocal/locking concerns.

Initializing sig and guarding POOL/LAST_USED updates under LOCK resolves the earlier crash/race noted in previous review.

from typing import Dict, Optional
from crawl4ai import AsyncWebCrawler, BrowserConfig
from typing import Dict
from utils import load_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify import path reliability for utils.

Absolute import from utils import load_config may break if deploy/docker isn't on sys.path. Prefer a package-relative import or confirm deploy/docker is a package with init.py and the entrypoint runs from this dir.

Run:


🏁 Script executed:

#!/bin/bash
# Check if deploy/docker is a package and whether utils is importable.
set -euo pipefail
if [ -f "deploy/docker/__init__.py" ]; then
  echo "deploy/docker is a package."
else
  echo "deploy/docker is NOT a package; relative imports won't work."
fi
echo "Listing files next to crawler_pool.py:"
ls -la deploy/docker || true
echo "Searching for demo entrypoint:"
fd -a --exact-path demo_docker_api.py || true

Length of output: 1558


Convert utils import to package-relative

  • Add an empty deploy/docker/__init__.py to make it a package
  • In deploy/docker/crawler_pool.py, change
    from utils import load_config
    to
    from .utils import load_config
🤖 Prompt for AI Agents
In deploy/docker/crawler_pool.py around line 6, the import uses a top-level
module import "from utils import load_config" which breaks package-relative
resolution; add an empty deploy/docker/__init__.py to make the directory a
package and change the import to use a relative import (from .utils import
load_config) so the module is resolved within the package namespace.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
deploy/docker/crawler_pool.py (4)

18-30: Pool key should include strategy module and config, not only class name.

Class name alone can collide; different strategy instances/configs will be coalesced.

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
+    if crawler_strategy is not None:
+        strategy_cls = crawler_strategy.__class__
+        payload["strategy"] = f"{strategy_cls.__module__}.{strategy_cls.__name__}"
+        # Include config when available to avoid reusing incompatible crawlers
+        if hasattr(crawler_strategy, "to_dict"):
+            payload["strategy_config"] = crawler_strategy.to_dict()
+        else:
+            payload["strategy_repr"] = repr(crawler_strategy)

76-79: Avoid double-wrapping exceptions; preserve original traceback.

Outer catch re-wraps errors already handled inside; also flagged by Ruff (BLE001/B904/TRY003).

-    except MemoryError:
-        raise
-    except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+    except MemoryError:
+        raise

91-95: Don’t await close() while holding the pool lock.

Release the lock before awaiting to avoid head‑of‑line blocking other callers.

 async def close_all():
-    async with LOCK:
-        await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True)
-        POOL.clear(); LAST_USED.clear()
+    async with LOCK:
+        crawlers = list(POOL.values())
+        POOL.clear(); LAST_USED.clear()
+    await asyncio.gather(*(c.close() for c in crawlers), return_exceptions=True)

96-105: Janitor awaits under LOCK; restructure to close outside the lock.

Prevents blocking get_crawler and other pool ops during close.

 async def janitor():
     while True:
         await asyncio.sleep(60)
         now = time.time()
-        async with LOCK:
-            for sig, crawler in list(POOL.items()):
-                if now - LAST_USED[sig] > IDLE_TTL:
-                    with suppress(Exception): await crawler.close()
-                    POOL.pop(sig, None); LAST_USED.pop(sig, None)
+        async with LOCK:
+            expired = [(sig, POOL[sig]) for sig, ts in list(LAST_USED.items()) if now - ts > IDLE_TTL]
+        for sig, crawler in expired:
+            with suppress(Exception):
+                await crawler.close()
+            async with LOCK:
+                POOL.pop(sig, None); LAST_USED.pop(sig, None)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9681567 and 87f35b8.

📒 Files selected for processing (1)
  • deploy/docker/crawler_pool.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (2)
crawl4ai/async_webcrawler.py (1)
  • AsyncWebCrawler (53-852)
crawl4ai/async_configs.py (1)
  • BrowserConfig (329-621)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py

51-51: Abstract raise to an inner function

(TRY301)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


68-68: Avoid specifying long messages outside the exception class

(TRY003)


78-78: Do not catch blind exception: Exception

(BLE001)


79-79: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


79-79: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
deploy/docker/crawler_pool.py (2)

6-6: Use package‑relative import to avoid runtime import errors.

Switch to a relative import so this works when deploy/docker is a package.

-from utils import load_config 
+from .utils import load_config

81-89: Good: pool mutations guarded with LOCK in finally.

Prevents races when updating LAST_USED or cleaning up after failures.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87f35b8 and 2b80f98.

📒 Files selected for processing (1)
  • deploy/docker/crawler_pool.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
  • AsyncWebCrawler (53-852)
crawl4ai/async_configs.py (1)
  • BrowserConfig (329-621)
deploy/docker/utils.py (1)
  • load_config (23-41)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py

59-59: Abstract raise to an inner function

(TRY301)


59-59: Avoid specifying long messages outside the exception class

(TRY003)


59-59: String contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)


77-77: Avoid specifying long messages outside the exception class

(TRY003)


87-87: Do not catch blind exception: Exception

(BLE001)


88-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


88-88: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +25 to +27
if crawler_strategy is not None:
payload["strategy"] = crawler_strategy.__class__.__name__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include strategy configuration in the pool signature.

Keying pooled crawlers only by crawler_strategy.__class__.__name__ collapses all instances of the same strategy class into a single slot. If two callers supply the same strategy class with different constructor arguments (e.g., PDFCrawlerStrategy(ocr=True) vs PDFCrawlerStrategy(ocr=False)), the second caller will be handed the already-started crawler that still holds the first instance, so their requested configuration is silently ignored. That’s a functional bug for any strategy with mutable/per-instance configuration. Please extend the signature to encode the strategy’s effective settings (e.g., module-qualified name plus a stable serialization of its config) so distinct instances don’t collide.

🤖 Prompt for AI Agents
In deploy/docker/crawler_pool.py around lines 25 to 27, the pool signature only
keys strategies by crawler_strategy.__class__.__name__, causing different
instances of the same class with different constructor arguments to collide; fix
this by including a stable, deterministic encoding of the instance configuration
in the payload key: derive the module-qualified class name (e.g.,
strategy.__class__.__module__ + "." + strategy.__class__.__name__) and a
serializable representation of the strategy's effective settings (prefer a
to_dict()/asdict() method on the strategy or fall back to
vars()/inspect.getfullargspec/init args), then JSON-serialize that config with
sort_keys=True (and avoid non-serializable members) and add it to payload (e.g.,
payload["strategy"] = "<module.ClassName>", payload["strategy_config"] =
"<stable-json>") so distinct configured instances do not collide.

Comment on lines 75 to 88
except Exception as e:
fut.set_exception(e)
raise RuntimeError("Failed to start browser") from e
finally:
async with LOCK:
CREATING.pop(sig, None)

# Await the shared result (crawler or the same exception)
return await fut

except MemoryError:
raise
except Exception as e:
raise RuntimeError(f"Failed to start browser: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Let cancellations propagate instead of masking them.

Both except Exception handlers convert asyncio.CancelledError into a RuntimeError, so a cancelled request/worker can no longer abort cleanly—it shows up as an internal failure and keeps the pool bookkeeping alive when it should unwind. We need to special-case CancelledError (cancel the shared future if we’re the creator and re-raise immediately) before the generic handlers.

-        if creator:
-            # Only the creator actually creates/starts the crawler
-            try:
+        if creator:
+            # Only the creator actually creates/starts the crawler
+            try:
                 crawler = AsyncWebCrawler(
                     config=cfg, thread_safe=False, crawler_strategy=crawler_strategy
                 )
                 await crawler.start()
                 async with LOCK:
                     POOL[sig] = crawler
                     LAST_USED[sig] = time.time()
                 fut.set_result(crawler)
-            except Exception as e:
+            except asyncio.CancelledError:
+                if not fut.done():
+                    fut.cancel()
+                raise
+            except Exception as e:
                 fut.set_exception(e)
                 raise RuntimeError("Failed to start browser") from e
             finally:
                 async with LOCK:
                     CREATING.pop(sig, None)
 
         # Await the shared result (crawler or the same exception)
         return await fut
 
     except MemoryError:
         raise
+    except asyncio.CancelledError:
+        raise
     except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+        raise RuntimeError("Failed to start browser") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
fut.set_exception(e)
raise RuntimeError("Failed to start browser") from e
finally:
async with LOCK:
CREATING.pop(sig, None)
# Await the shared result (crawler or the same exception)
return await fut
except MemoryError:
raise
except Exception as e:
raise RuntimeError(f"Failed to start browser: {e}")
if creator:
# Only the creator actually creates/starts the crawler
try:
crawler = AsyncWebCrawler(
config=cfg, thread_safe=False, crawler_strategy=crawler_strategy
)
await crawler.start()
async with LOCK:
POOL[sig] = crawler
LAST_USED[sig] = time.time()
fut.set_result(crawler)
except asyncio.CancelledError:
if not fut.done():
fut.cancel()
raise
except Exception as e:
fut.set_exception(e)
raise RuntimeError("Failed to start browser") from e
finally:
async with LOCK:
CREATING.pop(sig, None)
# Await the shared result (crawler or the same exception)
return await fut
except MemoryError:
raise
except asyncio.CancelledError:
raise
except Exception as e:
raise RuntimeError("Failed to start browser") from e
🧰 Tools
🪛 Ruff (0.13.1)

77-77: Avoid specifying long messages outside the exception class

(TRY003)


87-87: Do not catch blind exception: Exception

(BLE001)


88-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


88-88: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In deploy/docker/crawler_pool.py around lines 75 to 88, the generic except
blocks are converting asyncio.CancelledError into RuntimeError; change this by
adding a specific except asyncio.CancelledError handler before the generic
Exception handlers: if the current task is the creator (i.e., you set fut in
this scope) then cancel the shared future (fut.cancel()) or set its exception
accordingly, ensure CREATING cleanup still runs, and immediately re-raise the
CancelledError so cancellation propagates; then keep the existing MemoryError
and generic Exception handlers after that.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/examples/docker/demo_docker_api.py (1)

1285-1295: Reuse shared helpers for payload logging & output consistency

Every other demo routes through make_request, which already prints the payload, formats the response summary, and centralizes error handling. Dropping to raw client.post here duplicates some of that logic and results in mixed print/console.print output in the demo. Suggest delegating to make_request (and returning early if it yields None) so the PDF flow behaves like the rest.

-    resp = await client.post("/crawl", json=payload)
-    resp.raise_for_status()
-    data = resp.json()
-    print("=== Demo: PDF Crawl ===")
-    print("Success:", data.get("success"))
-    print("Number of results:", len(data.get("results", [])))
-    if data.get("results"):
-        first = data["results"][0]
-        text_snippet = (first.get("text") or "")[:500]
-        print("Extracted text (first 500 chars):")
-        print(text_snippet)
+    results = await make_request(client, "/crawl", payload, "Demo 10: PDF Crawl")
+    if not results:
+        return
+
+    first = results[0]
+    text_snippet = (first.get("text") or "")[:500]
+    console.print("[cyan]Extracted text (first 500 chars):[/]")
+    console.print(text_snippet)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b80f98 and 14994d8.

📒 Files selected for processing (2)
  • deploy/docker/crawler_pool.py (3 hunks)
  • docs/examples/docker/demo_docker_api.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
  • AsyncWebCrawler (53-852)
crawl4ai/async_configs.py (1)
  • BrowserConfig (329-621)
deploy/docker/utils.py (1)
  • load_config (23-41)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py

40-40: Statement ends with an unnecessary semicolon

Remove unnecessary semicolon

(E703)


43-43: Abstract raise to an inner function

(TRY301)


43-43: Avoid specifying long messages outside the exception class

(TRY003)


43-43: String contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

🔇 Additional comments (2)
deploy/docker/crawler_pool.py (2)

24-30: Include strategy configuration in the pool signature.

Line 27 only keys on crawler_strategy.__class__.__name__, so two callers using the same strategy class with different parameters will collide and get a crawler that was started with someone else’s settings. That breaks correctness for any strategy with per-instance config (e.g., PDF crawlers with different OCR flags) and will be extremely hard to debug.

Please extend the signature payload to encode both the module-qualified class name and a stable serialization of the strategy’s effective configuration (e.g., via to_dict()/vars() fallback) before hashing.

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
-
-    json_payload = json.dumps(payload, sort_keys=True, separators=(",", ":"))
+    if crawler_strategy is not None:
+        strategy_cls = crawler_strategy.__class__
+        payload["strategy"] = f"{strategy_cls.__module__}.{strategy_cls.__qualname__}"
+        if hasattr(crawler_strategy, "to_dict"):
+            payload["strategy_config"] = crawler_strategy.to_dict()
+        elif hasattr(crawler_strategy, "__dict__"):
+            payload["strategy_config"] = {
+                k: v
+                for k, v in vars(crawler_strategy).items()
+                if not callable(v) and not k.startswith("_")
+            }
+        else:
+            payload["strategy_config"] = repr(crawler_strategy)
+
+    json_payload = json.dumps(
+        payload,
+        sort_keys=True,
+        separators=(",", ":"),
+        default=lambda value: repr(value),
+    )

53-59: Guard pool bookkeeping with the lock.

Lines 53-59 mutate POOL/LAST_USED without holding LOCK, so concurrent requests can race the cleanup and either resurrect stale entries or throw KeyError. We already require the lock for all other pool touches—this block needs the same protection.

-    finally:
-        if sig and sig in POOL:
-            LAST_USED[sig] = time.time()
-        else:
-            # If we failed to start the browser, we should remove it from the pool
-            if sig:
-                POOL.pop(sig, None)
-                LAST_USED.pop(sig, None)
+    finally:
+        if sig:
+            async with LOCK:
+                if sig in POOL:
+                    LAST_USED[sig] = time.time()
+                else:
+                    POOL.pop(sig, None)
+                    LAST_USED.pop(sig, None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant