Feat/docker PDF strategy #1518

YorelN · 2025-09-25T16:29:33Z

Summary

This PR updates the crawl request handling to correctly apply the PDFCrawlerStrategy and PDFContentScrapingStrategy only when the URLs point to PDF files. Previously, all URLs were treated as web pages by default, which caused PDFs to be misprocessed.

Ensures that PDFContentScrapingStrategy is used for PDF URLs when no custom scraping strategy is provided.
Leaves non-PDF URLs to be handled by the default LXMLWebScrapingStrategy.
Updates both streaming and non-streaming crawl request handlers.

Fixes issues where PDFs were downloaded but not processed correctly, e.g., missing content extraction.

List of files changed and why

api.py – Modified handle_crawl_request and handle_stream_crawl_request to dynamically apply PDFCrawlerStrategy and PDFContentScrapingStrategy for PDF URLs.
__init__.py - Export PDFCrawlerStrategy, PDFContentScrapingStrategy
crawler_pool.py - Edit _sig() and get_crawler to handle crawler_strategy
demo_docker_api.py - Added examples/demo to illustrate PDF crawling usage.
utils.py - Added is_pdf_url to be able to update the strategies

How Has This Been Tested?

Tested crawling a PDF URL (https://arxiv.org/pdf/2310.06825) with and without a custom scraping strategy.
Verified that the PDF content is correctly extracted and returned in the response.
Non-PDF URLs continue to be processed with the standard web scraping strategy.
Unit tests for serialization and PDF content handling were run locally.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Summary by CodeRabbit

New Features
- Automatic PDF detection and dedicated PDF crawling/content extraction for regular and streaming endpoints; PDF content included in results (base64 when applicable) and returned in JSON.
Maintenance
- Strategy-aware routing for PDF workloads, crawler pool lifecycle improvements with idle pruning and bulk close, and consistent result serialization.
Chores
- Package API expanded to expose PDF crawling/scraping utilities and PDF URL detection helper.
Documentation
- Demo examples showing PDF crawl and streaming usage.
Tests
- End-to-end tests covering PDF extraction, metadata, and inaccessible-PDF handling.

coderabbitai · 2025-09-25T16:29:43Z

Walkthrough

Adds PDF-aware crawling: exposes PDF crawler/scraping strategies; API detects PDF URLs and selects PDF-specific strategies, encodes PDF content for JSON; crawler pool becomes strategy-aware and gains lifecycle helpers; adds async PDF detection utility, demo examples, and end-to-end PDF tests.

Changes

Cohort / File(s)	Summary of Changes
Public API exports `crawl4ai/__init__.py`	Imports and exposes `PDFCrawlerStrategy` and `PDFContentScrapingStrategy` in `__all__`.
API: PDF detection, strategy selection & serialization `deploy/docker/api.py`	Detects PDF URLs via `is_pdf_url`, rejects mixed PDF/non-PDF sets, selects/forwards `crawler_strategy` (PDF vs non-PDF), defaults scraping_strategy to `PDFContentScrapingStrategy` when appropriate, normalizes result shapes, base64-encodes PDF bytes, and uses `jsonable_encoder`; imports updated.
Crawler pool: strategy-aware pooling & maintenance `deploy/docker/crawler_pool.py`	Adds optional `crawler_strategy` to `_sig` and `get_crawler`, includes strategy in pool signature and crawler instantiation, refines pool cleanup, and adds `close_all()` and `janitor()` helpers for lifecycle maintenance.
PDF detection utility `deploy/docker/utils.py`	Adds `async def is_pdf_url(url: str) -> bool` that heuristically checks extension, HEAD Content-Type, and ranged GET magic-bytes using `httpx`; returns False on errors.
Docs: PDF demos `docs/examples/docker/demo_docker_api.py`	Adds `PDF_URL` constant and demo functions `demo_pdf_crawl` and `demo_pdf_crawl_stream` showing PDF crawl/stream usage with `PDFContentScrapingStrategy`; demos are available but not enabled by default.
Tests: REST PDF crawl `tests/docker/test_rest_api_pdf_crawl.py`	New end-to-end tests for PDF crawling covering successful extraction and non-accessible PDF behavior, with health checks, fixtures, and result-structure assertions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant API as API (/crawl)
  participant Utils as utils.is_pdf_url
  participant Pool as crawler_pool.get_crawler
  participant Crawler as AsyncWebCrawler
  participant Encoder as jsonable_encoder

  Client->>API: POST /crawl { urls, scraping_strategy? }
  API->>Utils: is_pdf_url(urls...)
  Utils-->>API: PDFs present? (bool)
  alt PDF detected
    API->>API: set crawler_strategy = PDFCrawlerStrategy
    API->>API: default scraping_strategy = PDFContentScrapingStrategy (if none)
  else Not PDF
    API->>API: use provided/default non-PDF strategy
  end
  API->>Pool: get_crawler(browser_config, crawler_strategy?)
  Pool-->>API: AsyncWebCrawler instance
  API->>Crawler: crawl(urls, scraping_strategy, options)
  Crawler-->>API: results (list/dict, PDFs as bytes)
  API->>API: base64-encode PDF bytes in results
  API->>Encoder: jsonable_encoder(results)
  API-->>Client: JSON response

sequenceDiagram
  autonumber
  participant Client
  participant API as API (/stream/crawl)
  participant Utils as utils.is_pdf_url
  participant Pool as crawler_pool.get_crawler
  participant Crawler as AsyncWebCrawler
  participant Encoder as jsonable_encoder

  Client->>API: POST /stream/crawl { urls, ... }
  API->>Utils: is_pdf_url(urls...)
  Utils-->>API: PDFs present? (bool)
  API->>Pool: get_crawler(browser_config, crawler_strategy?)
  Pool-->>API: AsyncWebCrawler
  API->>Crawler: stream_crawl(...)
  loop per event
    Crawler-->>API: partial result
    API->>API: base64-encode PDF bytes (if any)
    API->>Encoder: jsonable_encoder(partial)
    API-->>Client: event chunk
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through links and pages blue,
sniff for "%PDF" in every view.
Strategies set, pool doors swing wide,
bytes turn base64 for the ride.
A rabbit's cheers for the crawl inside. 🐇📄

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “Feat/docker PDF strategy” mentions the addition of a PDF strategy in the Docker context but lacks a clear verb and does not fully convey the change’s intent or scope, making it ambiguous for readers scanning the history.	Please revise the title to a concise, descriptive sentence that clearly states what was added or changed, for example “Add PDF crawling strategy to Docker API endpoints.”

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	The provided description matches the repository’s template by including a clear summary of the change, a detailed list of files modified with reasons, a thorough testing section, and a completed checklist verifying style, documentation, and tests.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

docs/examples/docker/demo_docker_api.py (1)

21-26: Fix duplicate assignments of BASE_URL and SIMPLE_URL.

Second assignments override the first and can confuse users.

-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020")
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235")
+BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235")
@@
-SIMPLE_URL = "https://example.com"  # For demo purposes
-SIMPLE_URL = "https://httpbin.org/html"
+SIMPLE_URL = "https://httpbin.org/html"

deploy/docker/api.py (3)

475-501: Single-URL path returns a non-iterable; serialization shape changed unexpectedly.

arun returns a single result (not iterable); iterating over it is a bug.
Wrapping with to_serializable_dict changes response structure ({"type":"dict","value":...}). Streaming still returns plain dicts — inconsistent API.

Apply this diff to fix both:

-        results = await partial_func()
+        results = await partial_func()
+        results_list = results if isinstance(results, list) else [results]
@@
-        for result in results:
+        for result in results_list:
             result_dict = result.model_dump()
             # If PDF exists, encode it to base64
             if result_dict.get('pdf') is not None:
                 result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
-
-            processed_results.append(to_serializable_dict(result_dict))
+            # Keep response shape consistent with streaming (plain JSON-serializable dict)
+            processed_results.append(jsonable_encoder(result_dict))

559-564: Respect rate limiter enabled flag (parity with non-streaming).

Streaming path always enables the rate limiter; non-streaming honors config flag.

-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
-            rate_limiter=RateLimiter(
-                base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
-            )
-        )
+        dispatcher = MemoryAdaptiveDispatcher(
+            memory_threshold_percent=config["crawler"]["memory_threshold_percent"],
+            rate_limiter=RateLimiter(
+                base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"])
+            ) if config["crawler"]["rate_limiter"]["enabled"] else None
+        )

541-546: Normalize URLs (add scheme) in streaming path.

Non-streaming adds https:// when missing; streaming does not, causing failures.

         browser_config = BrowserConfig.load(browser_config)
@@
         crawler_config = CrawlerRunConfig.load(crawler_config)
         crawler_config.stream = True
+
+        # Normalize URLs to include scheme (match non-streaming behavior)
+        urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]

🧹 Nitpick comments (11)

deploy/docker/crawler_pool.py (4)
29-31: Use a non-deprecated hash and avoid security-linter noise.

sha1 is flagged as insecure. While used as a key, prefer sha256 for hygiene.
-    return hashlib.sha1(json_payload.encode()).hexdigest()
+    return hashlib.sha256(json_payload.encode()).hexdigest()
41-47: Normalize message punctuation and avoid duplicate wrapping.

Replace EN DASH with hyphen to avoid linter noise.

The except MemoryError re-wrap adds redundant prefix.
-            if psutil.virtual_memory().percent >= MEM_LIMIT:
-                raise MemoryError("RAM pressure – new browser denied")
+            if psutil.virtual_memory().percent >= MEM_LIMIT:
+                raise MemoryError("RAM pressure - new browser denied")
@@
-    except MemoryError as e:
-        raise MemoryError(f"RAM pressure – new browser denied: {e}")
+    except MemoryError as e:
+        raise
37-43: Avoid heavy work under the global lock (scalability).

Creating/starting the crawler under LOCK blocks all callers. Use a two-phase approach (check/insert placeholder under lock, create/start outside, finalize under lock), or store an asyncio.Task/Future in POOL to dedupe concurrent requests for the same signature.

38-38: Remove stray semicolon.

Style/lint nit.
-                LAST_USED[sig] = time.time();  
+                LAST_USED[sig] = time.time()
docs/examples/docker/demo_docker_api.py (3)
1289-1296: Align PDF demo output with API payload shape.

The PDF result typically exposes extracted_content; text may be nested. Use a safe fallback.
-    if data.get("results"):
-        first = data["results"][0]
-        text_snippet = (first.get("text") or "")[:500]
-        print("Extracted text (first 500 chars):")
-        print(text_snippet)
+    if data.get("results"):
+        first = data["results"][0]
+        text = first.get("extracted_content") or first.get("text") or ""
+        if isinstance(text, dict):
+            text = text.get("text") or ""
+        print("Extracted text (first 500 chars):")
+        print((text or "")[:500])
1311-1317: Comment contradicts code.

You're explicitly setting a PDF strategy; remove “Default strategy if not set” in the comment.

30-30: Nit: add .pdf suffix for clarity (optional).

The detector handles content-type; adding .pdf can avoid an extra HEAD/GET sometimes.
tests/docker/test_rest_api_pdf_crawl.py (2)
15-16: Remove duplicate BASE_URL assignment.

Second line overrides the first; keep one.
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235") # If server is running in Docker, use the host's IP
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020") # If server is running in dev debug mode
+BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020")
99-103: Be robust to string vs dict extracted_content.

API may return extracted_content as a string. Guard before .get.
-        extracted_text = result["extracted_content"].get("text", "")
-        assert isinstance(extracted_text, str)
-        assert len(extracted_text) > 0
+        content = result.get("extracted_content")
+        extracted_text = content.get("text", "") if isinstance(content, dict) else (content or "")
+        assert isinstance(extracted_text, str) and len(extracted_text) > 0
deploy/docker/api.py (2)

447-454: Default PDF scraping strategies differ between non-stream and stream.

Non-stream sets extract_images=False; streaming sets True. If unintentional, align defaults.

492-500: Consider using jsonable_encoder for nested non-JSON types.

If model_dump contains datetimes/enums, jsonable_encoder ensures proper conversion.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and c709082.

📒 Files selected for processing (6)

crawl4ai/__init__.py (2 hunks)
deploy/docker/api.py (7 hunks)
deploy/docker/crawler_pool.py (3 hunks)
deploy/docker/utils.py (2 hunks)
docs/examples/docker/demo_docker_api.py (3 hunks)
tests/docker/test_rest_api_pdf_crawl.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

deploy/docker/api.py (4)

deploy/docker/utils.py (1)

is_pdf_url (130-155)

crawl4ai/processors/pdf/__init__.py (2)

PDFCrawlerStrategy (10-29)

PDFContentScrapingStrategy (31-192)

crawl4ai/async_configs.py (1)

to_serializable_dict (44-115)

deploy/docker/crawler_pool.py (1)

get_crawler (33-57)

crawl4ai/__init__.py (1)

crawl4ai/processors/pdf/__init__.py (2)

PDFCrawlerStrategy (10-29)

PDFContentScrapingStrategy (31-192)

deploy/docker/crawler_pool.py (3)

crawl4ai/async_webcrawler.py (1)

AsyncWebCrawler (53-852)

crawl4ai/async_configs.py (1)

BrowserConfig (329-621)

deploy/docker/utils.py (1)

load_config (23-41)

🪛 Ruff (0.13.1)

tests/docker/test_rest_api_pdf_crawl.py

28-28: Consider moving this statement to an else block

(TRY300)

deploy/docker/utils.py

152-152: Do not catch blind exception: Exception

(BLE001)

deploy/docker/crawler_pool.py

30-30: Probable use of insecure hash functions in hashlib: sha1

(S324)

38-38: Statement ends with an unnecessary semicolon

Remove unnecessary semicolon

(E703)

41-41: Abstract raise to an inner function

(TRY301)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

41-41: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

🔇 Additional comments (2)

crawl4ai/__init__.py (2)

14-15: LGTM: Publicly exporting PDF strategies.

Export additions look correct and match usages in API and demos.

Please confirm crawl4ai/processors/pdf/init.py is part of the installable package (MANIFEST/pyproject) so these symbols are always available.

134-136: LGTM: all includes new PDF symbols.

Consistent with imports.

deploy/docker/api.py

deploy/docker/crawler_pool.py

deploy/docker/utils.py

tests/docker/test_rest_api_pdf_crawl.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c709082 and 481e7fe.

📒 Files selected for processing (1)

tests/docker/test_rest_api_pdf_crawl.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

tests/docker/test_rest_api_pdf_crawl.py

28-28: Consider moving this statement to an else block

(TRY300)

tests/docker/test_rest_api_pdf_crawl.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (7)

deploy/docker/utils.py (1)

121-129: Avoid blind exception catch in verify_email_domain.

Catching Exception hides real issues; narrow to dns.resolver.* and log.
-    except Exception as e:
-        return False
+    except (dns.resolver.NoAnswer, dns.resolver.NXDOMAIN, dns.resolver.NoNameservers, dns.exception.DNSException):
+        return False

docs/examples/docker/demo_docker_api.py (1)

1287-1297: Use rich Console for output to stay consistent with the rest of the demo.

Replace print calls with console.print panels/snippets.

tests/docker/test_rest_api_pdf_crawl.py (1)

101-116: Add a test for mixed URL types (should 400).

API now rejects mixed PDF/non-PDF batches; add coverage.

Proposed test:

@pytest.mark.asyncio
async def test_pdf_mixed_urls_rejected(async_client: httpx.AsyncClient):
    payload = {
        "urls": [PDF_TEST_URL, "https://example.com"],
        "browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
        "crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "BYPASS"}}
    }
    resp = await async_client.post("/crawl", json=payload)
    assert resp.status_code == 400
    detail = resp.json().get("detail", "")
    assert "Mix of PDF and non-PDF URLs" in detail

deploy/docker/crawler_pool.py (2)

18-31: Use fully-qualified strategy identifier in pool signature.

Class name alone can collide across modules; hash module + qualname.

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
+    if crawler_strategy is not None:
+        cls = crawler_strategy.__class__
+        payload["strategy"] = f"{cls.__module__}.{cls.__qualname__}"

33-50: Exception handling nits: remove unused var, preserve traceback.

Tidy per Ruff BLE001/B904 and avoid semicolons.

-            POOL[sig] = crawler; LAST_USED[sig] = time.time()
+            POOL[sig] = crawler
+            LAST_USED[sig] = time.time()
             return crawler
-    except MemoryError as e:
-        raise
-    except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+    except MemoryError:
+        raise
+    except Exception as e:
+        raise RuntimeError(f"Failed to start browser: {e}") from e

deploy/docker/api.py (2)

443-452: Compute is_pdf with all(...) for clarity and correctness.

You already reject mixed sets; using all() communicates intent and avoids confusion.

-        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
-        is_pdf = any(is_pdf_flags)
+        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
+        is_pdf = all(is_pdf_flags)
         if any(is_pdf_flags) and not all(is_pdf_flags):

554-566: Apply all(...) in streaming path too to mirror non-streaming logic.

Keeps behavior consistent.

-        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
-        is_pdf = any(is_pdf_flags)
+        is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls))
+        is_pdf = all(is_pdf_flags)
         if any(is_pdf_flags) and not all(is_pdf_flags):

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 726f41a and 1636383.

📒 Files selected for processing (5)

deploy/docker/api.py (7 hunks)
deploy/docker/crawler_pool.py (2 hunks)
deploy/docker/utils.py (2 hunks)
docs/examples/docker/demo_docker_api.py (3 hunks)
tests/docker/test_rest_api_pdf_crawl.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

deploy/docker/api.py (3)

deploy/docker/utils.py (1)

is_pdf_url (130-157)

crawl4ai/processors/pdf/__init__.py (2)

PDFCrawlerStrategy (10-29)

PDFContentScrapingStrategy (31-192)

deploy/docker/crawler_pool.py (1)

get_crawler (33-59)

deploy/docker/crawler_pool.py (3)

crawl4ai/async_webcrawler.py (1)

AsyncWebCrawler (53-852)

crawl4ai/async_configs.py (1)

BrowserConfig (329-621)

deploy/docker/utils.py (1)

load_config (23-41)

🪛 Ruff (0.13.1)

deploy/docker/api.py

446-449: Abstract raise to an inner function

(TRY301)

560-563: Abstract raise to an inner function

(TRY301)

deploy/docker/crawler_pool.py

42-42: Abstract raise to an inner function

(TRY301)

42-42: Avoid specifying long messages outside the exception class

(TRY003)

45-45: Multiple statements on one line (semicolon)

(E702)

47-47: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

49-49: Do not catch blind exception: Exception

(BLE001)

50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

50-50: Avoid specifying long messages outside the exception class

(TRY003)

tests/docker/test_rest_api_pdf_crawl.py

30-30: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (1)

deploy/docker/api.py (1)

498-508: LGTM: JSON-safe result shaping with base64 PDF bytes.

Good normalization and use of jsonable_encoder maintains streaming/non-streaming parity.

Please confirm no nested Pydantic models remain in result_dict that would need model_dump before encoding.

deploy/docker/crawler_pool.py

deploy/docker/utils.py

docs/examples/docker/demo_docker_api.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

docs/examples/docker/demo_docker_api.py (6)

28-28: Make PDF_URL configurable via env for reliability.

Allow overriding the demo PDF URL without code edits (useful behind firewalls or when the remote file changes).

Apply:

-PDF_URL = "https://arxiv.org/pdf/2310.06825.pdf" # For PDF demo
+PDF_URL = os.getenv("PDF_URL", "https://arxiv.org/pdf/2310.06825.pdf")  # For PDF demo

1265-1283: Optionally showcase backend auto-detection by omitting scraping_strategy.

Since the API now selects PDF strategies for .pdf URLs when no custom strategy is provided, add a variant (or toggle) that relies on auto‑detection to demonstrate the new default behavior.

Apply:

         "crawler_config": {
             "type": "CrawlerRunConfig",
             "params": {
                 "cache_mode": "BYPASS",
-                "scraping_strategy": {
-                    "type": "PDFContentScrapingStrategy",
-                    "params": {
-                        "extract_images": False,
-                        "save_images_locally": False,
-                        "batch_size": 2
-                    }
-                }
             }
         }

1285-1298: Reuse make_request for consistency and richer error output.

Unifies logging, timing, and error handling with other demos.

Apply:

-    resp = await client.post("/crawl", json=payload)
-    resp.raise_for_status()
-    data = resp.json()
-    print("=== Demo: PDF Crawl ===")
-    print("Success:", data.get("success"))
-    print("Number of results:", len(data.get("results", [])))
-    if data.get("results"):
-        first = data["results"][0]
-        text = first.get("extracted_content") or first.get("text") or ""
-        if isinstance(text, dict):
-            text = text.get("text") or ""
-        print("Extracted text (first 500 chars):")
-        print((text or "")[:500])
+    results = await make_request(client, "/crawl", payload, "Demo PDF: Crawl")
+    if results:
+        first = results[0]
+        text = first.get("extracted_content") or first.get("text") or ""
+        if isinstance(text, dict):
+            text = text.get("text") or ""
+        console.print(Panel((text or "")[:500], title="Extracted text (first 500 chars)",
+                            border_style="cyan", expand=False))

1306-1324: Optionally rely on auto-detection in streaming demo too.

Mirror the non‑streaming suggestion to demonstrate default PDF handling.

Apply:

             "params": {
                 "stream": True,
                 "cache_mode": "BYPASS",
-                "scraping_strategy": {
-                    "type": "PDFContentScrapingStrategy",
-                    "params": {
-                        "extract_images": False,     
-                        "save_images_locally": False,
-                        "batch_size": 2              
-                    }
-                }
             }

1326-1331: Enhance streaming UX: show text chunks when present.

Extend stream_request to print short snippets when a streamed line includes extracted text (common for PDFs).

You can update stream_request (outside this hunk) along these lines:

# inside the async for line in response.aiter_lines() loop, after parsing `data`
if isinstance(data, dict):
    text_chunk = data.get("extracted_content") or data.get("text")
    if isinstance(text_chunk, dict):
        text_chunk = text_chunk.get("text")
    if isinstance(text_chunk, str) and text_chunk.strip():
        snippet = text_chunk[:200].replace("\n", " ")
        console.print(f"    [cyan]Text chunk:[/] {snippet}...")

1366-1369: Gate PDF demos with an env toggle instead of comments.

Keeps them off by default but easy to enable in CI/local runs.

Apply:

-        # await demo_pdf_crawl_stream(client)
-        # await demo_pdf_crawl(client)
+        if os.getenv("RUN_PDF_DEMOS") == "1":
+            await demo_pdf_crawl_stream(client)
+            await demo_pdf_crawl(client)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1636383 and ec2f88a.

📒 Files selected for processing (2)

deploy/docker/utils.py (2 hunks)
docs/examples/docker/demo_docker_api.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

deploy/docker/utils.py

🔇 Additional comments (3)

docs/examples/docker/demo_docker_api.py (3)

1265-1283: browser_config now present — resolves prior validation issue.

This addresses the earlier 422 risk and aligns with other demos.

1306-1324: browser_config present in streaming payload — resolved.

This fixes the previously flagged schema mismatch for streaming.

1265-1283: Ignore addition of crawler_strategy in demo payload
The REST handler doesn’t accept a crawler_strategy field—it's determined server-side from the PDF flag.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (6)

deploy/docker/crawler_pool.py (6)

18-30: Make strategy fingerprint more robust to avoid pool key collisions.

Hashing only the class name may collide across differently configured instances of the same strategy. Prefer using a stable per-strategy signature or config when available, with class name as a fallback.

Apply this diff:

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
+    if crawler_strategy is not None:
+        # Prefer a stable, user-provided signature/config if present
+        strat_sig = getattr(crawler_strategy, "signature", None)
+        if callable(strat_sig):
+            try:
+                strat_sig = strat_sig()
+            except Exception:
+                strat_sig = None
+        if strat_sig is None:
+            strat_cfg = getattr(crawler_strategy, "to_dict", None)
+            if callable(strat_cfg):
+                with suppress(Exception):
+                    strat_sig = strat_cfg()
+        payload["strategy"] = strat_sig or crawler_strategy.__class__.__name__

47-51: Tighten exception handling and preserve the original cause.

Remove the unused variable, avoid blind catch if possible, and chain the original error.

Apply this diff:

-    except MemoryError as e:
-        raise
-    except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+    except MemoryError:
+        raise
+    except Exception as e:
+        # Consider narrowing the exception types if feasible.
+        raise RuntimeError("Failed to start browser") from e

45-45: Split statements; fix Ruff E702.

Avoid multiple statements on one line.

Apply this diff:

-            POOL[sig] = crawler; LAST_USED[sig] = time.time()
+            POOL[sig] = crawler
+            LAST_USED[sig] = time.time()

60-64: Close crawlers outside the lock to avoid blocking get_crawler.

Awaiting .close() while holding LOCK can stall all callers.

Apply this diff:

-async def close_all():
-    async with LOCK:
-        await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True)
-        POOL.clear(); LAST_USED.clear()
+async def close_all():
+    async with LOCK:
+        crawlers = list(POOL.values())
+        POOL.clear()
+        LAST_USED.clear()
+    await asyncio.gather(*(c.close() for c in crawlers), return_exceptions=True)

65-73: Avoid awaiting close() inside janitor’s critical section.

Collect stale entries under LOCK, release it for the close awaits, then clean maps under LOCK. Also fix semicolons.

Apply this diff:

 async def janitor():
     while True:
         await asyncio.sleep(60)
         now = time.time()
-        async with LOCK:
-            for sig, crawler in list(POOL.items()):
-                if now - LAST_USED[sig] > IDLE_TTL:
-                    with suppress(Exception): await crawler.close()
-                    POOL.pop(sig, None); LAST_USED.pop(sig, None)
+        to_close = []
+        async with LOCK:
+            for sig, crawler in list(POOL.items()):
+                if now - LAST_USED.get(sig, 0) > IDLE_TTL:
+                    to_close.append((sig, crawler))
+        for sig, crawler in to_close:
+            with suppress(Exception):
+                await crawler.close()
+        async with LOCK:
+            for sig, _ in to_close:
+                POOL.pop(sig, None)
+                LAST_USED.pop(sig, None)

33-33: Type hint crawler_strategy to the expected protocol/type.

Optional[object] is too broad; prefer the concrete AsyncCrawlerStrategy (or a minimal Protocol exposing signature()/to_dict()) to improve correctness and editor support.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec2f88a and 9681567.

📒 Files selected for processing (1)

deploy/docker/crawler_pool.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

deploy/docker/crawler_pool.py (3)

crawl4ai/async_webcrawler.py (1)

AsyncWebCrawler (53-852)

crawl4ai/async_configs.py (1)

BrowserConfig (329-621)

deploy/docker/utils.py (1)

load_config (23-41)

🪛 Ruff (0.13.1)

deploy/docker/crawler_pool.py

42-42: Abstract raise to an inner function

(TRY301)

42-42: Avoid specifying long messages outside the exception class

(TRY003)

45-45: Multiple statements on one line (semicolon)

(E702)

47-47: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

49-49: Do not catch blind exception: Exception

(BLE001)

50-50: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

50-50: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)

deploy/docker/crawler_pool.py (1)

52-59: Good fix for prior UnboundLocal/locking concerns.

Initializing sig and guarding POOL/LAST_USED updates under LOCK resolves the earlier crash/race noted in previous review.

coderabbitai · 2025-09-26T14:47:02Z

deploy/docker/crawler_pool.py

+from typing import Dict, Optional
 from crawl4ai import AsyncWebCrawler, BrowserConfig
-from typing import Dict
 from utils import load_config 


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify import path reliability for utils.

Absolute import from utils import load_config may break if deploy/docker isn't on sys.path. Prefer a package-relative import or confirm deploy/docker is a package with init.py and the entrypoint runs from this dir.

Run:

🏁 Script executed:

#!/bin/bash # Check if deploy/docker is a package and whether utils is importable. set -euo pipefail if [ -f "deploy/docker/__init__.py" ]; then echo "deploy/docker is a package." else echo "deploy/docker is NOT a package; relative imports won't work." fi echo "Listing files next to crawler_pool.py:" ls -la deploy/docker || true echo "Searching for demo entrypoint:" fd -a --exact-path demo_docker_api.py || true

Length of output: 1558

Convert utils import to package-relative

Add an empty deploy/docker/__init__.py to make it a package

In deploy/docker/crawler_pool.py, change
from utils import load_config
to
from .utils import load_config

🤖 Prompt for AI Agents

In deploy/docker/crawler_pool.py around line 6, the import uses a top-level module import "from utils import load_config" which breaks package-relative resolution; add an empty deploy/docker/__init__.py to make the directory a package and change the import to use a relative import (from .utils import load_config) so the module is resolved within the package namespace.

deploy/docker/crawler_pool.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

deploy/docker/crawler_pool.py (4)

18-30: Pool key should include strategy module and config, not only class name.

Class name alone can collide; different strategy instances/configs will be coalesced.

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
+    if crawler_strategy is not None:
+        strategy_cls = crawler_strategy.__class__
+        payload["strategy"] = f"{strategy_cls.__module__}.{strategy_cls.__name__}"
+        # Include config when available to avoid reusing incompatible crawlers
+        if hasattr(crawler_strategy, "to_dict"):
+            payload["strategy_config"] = crawler_strategy.to_dict()
+        else:
+            payload["strategy_repr"] = repr(crawler_strategy)

76-79: Avoid double-wrapping exceptions; preserve original traceback.

Outer catch re-wraps errors already handled inside; also flagged by Ruff (BLE001/B904/TRY003).

-    except MemoryError:
-        raise
-    except Exception as e:
-        raise RuntimeError(f"Failed to start browser: {e}")
+    except MemoryError:
+        raise

91-95: Don’t await close() while holding the pool lock.

Release the lock before awaiting to avoid head‑of‑line blocking other callers.

 async def close_all():
-    async with LOCK:
-        await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True)
-        POOL.clear(); LAST_USED.clear()
+    async with LOCK:
+        crawlers = list(POOL.values())
+        POOL.clear(); LAST_USED.clear()
+    await asyncio.gather(*(c.close() for c in crawlers), return_exceptions=True)

96-105: Janitor awaits under LOCK; restructure to close outside the lock.

Prevents blocking get_crawler and other pool ops during close.

 async def janitor():
     while True:
         await asyncio.sleep(60)
         now = time.time()
-        async with LOCK:
-            for sig, crawler in list(POOL.items()):
-                if now - LAST_USED[sig] > IDLE_TTL:
-                    with suppress(Exception): await crawler.close()
-                    POOL.pop(sig, None); LAST_USED.pop(sig, None)
+        async with LOCK:
+            expired = [(sig, POOL[sig]) for sig, ts in list(LAST_USED.items()) if now - ts > IDLE_TTL]
+        for sig, crawler in expired:
+            with suppress(Exception):
+                await crawler.close()
+            async with LOCK:
+                POOL.pop(sig, None); LAST_USED.pop(sig, None)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9681567 and 87f35b8.

📒 Files selected for processing (1)

deploy/docker/crawler_pool.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

deploy/docker/crawler_pool.py (2)

crawl4ai/async_webcrawler.py (1)

AsyncWebCrawler (53-852)

crawl4ai/async_configs.py (1)

BrowserConfig (329-621)

🪛 Ruff (0.13.1)

deploy/docker/crawler_pool.py

51-51: Abstract raise to an inner function

(TRY301)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

68-68: Avoid specifying long messages outside the exception class

(TRY003)

78-78: Do not catch blind exception: Exception

(BLE001)

79-79: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

79-79: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)

deploy/docker/crawler_pool.py (2)
6-6: Use package‑relative import to avoid runtime import errors.

Switch to a relative import so this works when deploy/docker is a package.
-from utils import load_config 
+from .utils import load_config
81-89: Good: pool mutations guarded with LOCK in finally.

Prevents races when updating LAST_USED or cleaning up after failures.

deploy/docker/crawler_pool.py

…me Future

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87f35b8 and 2b80f98.

📒 Files selected for processing (1)

deploy/docker/crawler_pool.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

deploy/docker/crawler_pool.py (3)

crawl4ai/async_webcrawler.py (1)

AsyncWebCrawler (53-852)

crawl4ai/async_configs.py (1)

BrowserConfig (329-621)

deploy/docker/utils.py (1)

load_config (23-41)

🪛 Ruff (0.13.1)

deploy/docker/crawler_pool.py

59-59: Abstract raise to an inner function

(TRY301)

59-59: Avoid specifying long messages outside the exception class

(TRY003)

59-59: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

77-77: Avoid specifying long messages outside the exception class

(TRY003)

87-87: Do not catch blind exception: Exception

(BLE001)

88-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

88-88: Avoid specifying long messages outside the exception class

(TRY003)

coderabbitai · 2025-09-26T16:25:39Z

deploy/docker/crawler_pool.py

+    if crawler_strategy is not None:
+        payload["strategy"] = crawler_strategy.__class__.__name__
+


⚠️ Potential issue | 🟠 Major

Include strategy configuration in the pool signature.

Keying pooled crawlers only by crawler_strategy.__class__.__name__ collapses all instances of the same strategy class into a single slot. If two callers supply the same strategy class with different constructor arguments (e.g., PDFCrawlerStrategy(ocr=True) vs PDFCrawlerStrategy(ocr=False)), the second caller will be handed the already-started crawler that still holds the first instance, so their requested configuration is silently ignored. That’s a functional bug for any strategy with mutable/per-instance configuration. Please extend the signature to encode the strategy’s effective settings (e.g., module-qualified name plus a stable serialization of its config) so distinct instances don’t collide.

🤖 Prompt for AI Agents

In deploy/docker/crawler_pool.py around lines 25 to 27, the pool signature only keys strategies by crawler_strategy.__class__.__name__, causing different instances of the same class with different constructor arguments to collide; fix this by including a stable, deterministic encoding of the instance configuration in the payload key: derive the module-qualified class name (e.g., strategy.__class__.__module__ + "." + strategy.__class__.__name__) and a serializable representation of the strategy's effective settings (prefer a to_dict()/asdict() method on the strategy or fall back to vars()/inspect.getfullargspec/init args), then JSON-serialize that config with sort_keys=True (and avoid non-serializable members) and add it to payload (e.g., payload["strategy"] = "<module.ClassName>", payload["strategy_config"] = "<stable-json>") so distinct configured instances do not collide.

coderabbitai · 2025-09-26T16:25:39Z

deploy/docker/crawler_pool.py

+            except Exception as e:
+                fut.set_exception(e)
+                raise RuntimeError("Failed to start browser") from e
+            finally:
+                async with LOCK:
+                    CREATING.pop(sig, None)
+
+        # Await the shared result (crawler or the same exception)
+        return await fut
+
+    except MemoryError:
+        raise
    except Exception as e:
        raise RuntimeError(f"Failed to start browser: {e}")


⚠️ Potential issue | 🟠 Major

Let cancellations propagate instead of masking them.

Both except Exception handlers convert asyncio.CancelledError into a RuntimeError, so a cancelled request/worker can no longer abort cleanly—it shows up as an internal failure and keeps the pool bookkeeping alive when it should unwind. We need to special-case CancelledError (cancel the shared future if we’re the creator and re-raise immediately) before the generic handlers.

- if creator: - # Only the creator actually creates/starts the crawler - try: + if creator: + # Only the creator actually creates/starts the crawler + try: crawler = AsyncWebCrawler( config=cfg, thread_safe=False, crawler_strategy=crawler_strategy ) await crawler.start() async with LOCK: POOL[sig] = crawler LAST_USED[sig] = time.time() fut.set_result(crawler) - except Exception as e: + except asyncio.CancelledError: + if not fut.done(): + fut.cancel() + raise + except Exception as e: fut.set_exception(e) raise RuntimeError("Failed to start browser") from e finally: async with LOCK: CREATING.pop(sig, None) # Await the shared result (crawler or the same exception) return await fut except MemoryError: raise + except asyncio.CancelledError: + raise except Exception as e: - raise RuntimeError(f"Failed to start browser: {e}") + raise RuntimeError("Failed to start browser") from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

fut.set_exception(e)

raise RuntimeError("Failed to start browser") from e

finally:

async with LOCK:

CREATING.pop(sig, None)

# Await the shared result (crawler or the same exception)

return await fut

except MemoryError:

raise

except Exception as e:

raise RuntimeError(f"Failed to start browser: {e}")

if creator:

# Only the creator actually creates/starts the crawler

try:

crawler = AsyncWebCrawler(

config=cfg, thread_safe=False, crawler_strategy=crawler_strategy

)

await crawler.start()

async with LOCK:

POOL[sig] = crawler

LAST_USED[sig] = time.time()

fut.set_result(crawler)

except asyncio.CancelledError:

if not fut.done():

fut.cancel()

raise

except Exception as e:

fut.set_exception(e)

raise RuntimeError("Failed to start browser") from e

finally:

async with LOCK:

CREATING.pop(sig, None)

# Await the shared result (crawler or the same exception)

return await fut

except MemoryError:

raise

except asyncio.CancelledError:

raise

except Exception as e:

raise RuntimeError("Failed to start browser") from e

🧰 Tools

🪛 Ruff (0.13.1)

77-77: Avoid specifying long messages outside the exception class

(TRY003)

87-87: Do not catch blind exception: Exception

(BLE001)

88-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

88-88: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In deploy/docker/crawler_pool.py around lines 75 to 88, the generic except blocks are converting asyncio.CancelledError into RuntimeError; change this by adding a specific except asyncio.CancelledError handler before the generic Exception handlers: if the current task is the creator (i.e., you set fut in this scope) then cancel the shared future (fut.cancel()) or set its exception accordingly, ensure CREATING cleanup still runs, and immediately re-raise the CancelledError so cancellation propagates; then keep the existing MemoryError and generic Exception handlers after that.

This reverts commit bc3b3d1.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/examples/docker/demo_docker_api.py (1)
1285-1295: Reuse shared helpers for payload logging & output consistency

Every other demo routes through make_request, which already prints the payload, formats the response summary, and centralizes error handling. Dropping to raw client.post here duplicates some of that logic and results in mixed print/console.print output in the demo. Suggest delegating to make_request (and returning early if it yields None) so the PDF flow behaves like the rest.
-    resp = await client.post("/crawl", json=payload)
-    resp.raise_for_status()
-    data = resp.json()
-    print("=== Demo: PDF Crawl ===")
-    print("Success:", data.get("success"))
-    print("Number of results:", len(data.get("results", [])))
-    if data.get("results"):
-        first = data["results"][0]
-        text_snippet = (first.get("text") or "")[:500]
-        print("Extracted text (first 500 chars):")
-        print(text_snippet)
+    results = await make_request(client, "/crawl", payload, "Demo 10: PDF Crawl")
+    if not results:
+        return
+
+    first = results[0]
+    text_snippet = (first.get("text") or "")[:500]
+    console.print("[cyan]Extracted text (first 500 chars):[/]")
+    console.print(text_snippet)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b80f98 and 14994d8.

📒 Files selected for processing (2)

deploy/docker/crawler_pool.py (3 hunks)
docs/examples/docker/demo_docker_api.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

deploy/docker/crawler_pool.py (3)

crawl4ai/async_webcrawler.py (1)

AsyncWebCrawler (53-852)

crawl4ai/async_configs.py (1)

BrowserConfig (329-621)

deploy/docker/utils.py (1)

load_config (23-41)

🪛 Ruff (0.13.1)

deploy/docker/crawler_pool.py

40-40: Statement ends with an unnecessary semicolon

Remove unnecessary semicolon

(E703)

43-43: Abstract raise to an inner function

(TRY301)

43-43: Avoid specifying long messages outside the exception class

(TRY003)

43-43: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

🔇 Additional comments (2)

deploy/docker/crawler_pool.py (2)

24-30: Include strategy configuration in the pool signature.

Line 27 only keys on crawler_strategy.__class__.__name__, so two callers using the same strategy class with different parameters will collide and get a crawler that was started with someone else’s settings. That breaks correctness for any strategy with per-instance config (e.g., PDF crawlers with different OCR flags) and will be extremely hard to debug.

Please extend the signature payload to encode both the module-qualified class name and a stable serialization of the strategy’s effective configuration (e.g., via to_dict()/vars() fallback) before hashing.

-    if crawler_strategy is not None:
-        payload["strategy"] = crawler_strategy.__class__.__name__
-
-    json_payload = json.dumps(payload, sort_keys=True, separators=(",", ":"))
+    if crawler_strategy is not None:
+        strategy_cls = crawler_strategy.__class__
+        payload["strategy"] = f"{strategy_cls.__module__}.{strategy_cls.__qualname__}"
+        if hasattr(crawler_strategy, "to_dict"):
+            payload["strategy_config"] = crawler_strategy.to_dict()
+        elif hasattr(crawler_strategy, "__dict__"):
+            payload["strategy_config"] = {
+                k: v
+                for k, v in vars(crawler_strategy).items()
+                if not callable(v) and not k.startswith("_")
+            }
+        else:
+            payload["strategy_config"] = repr(crawler_strategy)
+
+    json_payload = json.dumps(
+        payload,
+        sort_keys=True,
+        separators=(",", ":"),
+        default=lambda value: repr(value),
+    )

53-59: Guard pool bookkeeping with the lock.

Lines 53-59 mutate POOL/LAST_USED without holding LOCK, so concurrent requests can race the cleanup and either resurrect stale entries or throw KeyError. We already require the lock for all other pool touches—this block needs the same protection.

-    finally:
-        if sig and sig in POOL:
-            LAST_USED[sig] = time.time()
-        else:
-            # If we failed to start the browser, we should remove it from the pool
-            if sig:
-                POOL.pop(sig, None)
-                LAST_USED.pop(sig, None)
+    finally:
+        if sig:
+            async with LOCK:
+                if sig in POOL:
+                    LAST_USED[sig] = time.time()
+                else:
+                    POOL.pop(sig, None)
+                    LAST_USED.pop(sig, None)

nicolas145 added 5 commits September 25, 2025 11:35

fix: add scraping strategy for pdf

9c6e016

fix: typo

057fb61

fix: add test suits

5adc9dc

fix: typo

d231d61

fix: remove unused code

c709082

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

nicolas145 added 2 commits September 25, 2025 12:36

fix: remove broken test

481e7fe

fix:remove deep crawl form tests

726f41a

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

tests/docker/test_rest_api_pdf_crawl.py Outdated Show resolved Hide resolved

nicolas145 added 7 commits September 25, 2025 12:48

fix: coderabbit recommendations

47bd392

test: avoid overriding url

e17484f

fix: security & lint

94de653

fix: use safe callback

bc3b3d1

fix: guards and consistency

44b280b

fix: use jsonable_encoder

21f80ff

fix: timeout params

1636383

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

deploy/docker/crawler_pool.py Show resolved Hide resolved

deploy/docker/utils.py Outdated Show resolved Hide resolved

docs/examples/docker/demo_docker_api.py Show resolved Hide resolved

docs/examples/docker/demo_docker_api.py Show resolved Hide resolved

nicolas145 added 3 commits September 26, 2025 10:34

fix: http params

056721a

fix: consistency

ec2f88a

fix: missing check

9681567

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

fix: dedupe concurrent creators

87f35b8

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

deploy/docker/crawler_pool.py Outdated Show resolved Hide resolved

fix: dedup race: non‑creator callers may double‑create and set the sa…

2b80f98

…me Future

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

nicolas145 added 2 commits September 26, 2025 14:29

Revert "fix: use safe callback"

703243d

This reverts commit bc3b3d1.

fix: revert to initial

14994d8

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

		if crawler_strategy is not None:
		payload["strategy"] = crawler_strategy.__class__.__name__

Uh oh!

Feat/docker PDF strategy #1518

Are you sure you want to change the base?

Feat/docker PDF strategy #1518

Uh oh!

Conversation

YorelN commented Sep 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YorelN commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading