-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Feat/docker PDF strategy #1518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/docker PDF strategy #1518
Conversation
WalkthroughAdds PDF-aware crawling: exposes PDF crawler/scraping strategies; API detects PDF URLs and selects PDF-specific strategies, encodes PDF content for JSON; crawler pool becomes strategy-aware and gains lifecycle helpers; adds async PDF detection utility, demo examples, and end-to-end PDF tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant API as API (/crawl)
participant Utils as utils.is_pdf_url
participant Pool as crawler_pool.get_crawler
participant Crawler as AsyncWebCrawler
participant Encoder as jsonable_encoder
Client->>API: POST /crawl { urls, scraping_strategy? }
API->>Utils: is_pdf_url(urls...)
Utils-->>API: PDFs present? (bool)
alt PDF detected
API->>API: set crawler_strategy = PDFCrawlerStrategy
API->>API: default scraping_strategy = PDFContentScrapingStrategy (if none)
else Not PDF
API->>API: use provided/default non-PDF strategy
end
API->>Pool: get_crawler(browser_config, crawler_strategy?)
Pool-->>API: AsyncWebCrawler instance
API->>Crawler: crawl(urls, scraping_strategy, options)
Crawler-->>API: results (list/dict, PDFs as bytes)
API->>API: base64-encode PDF bytes in results
API->>Encoder: jsonable_encoder(results)
API-->>Client: JSON response
sequenceDiagram
autonumber
participant Client
participant API as API (/stream/crawl)
participant Utils as utils.is_pdf_url
participant Pool as crawler_pool.get_crawler
participant Crawler as AsyncWebCrawler
participant Encoder as jsonable_encoder
Client->>API: POST /stream/crawl { urls, ... }
API->>Utils: is_pdf_url(urls...)
Utils-->>API: PDFs present? (bool)
API->>Pool: get_crawler(browser_config, crawler_strategy?)
Pool-->>API: AsyncWebCrawler
API->>Crawler: stream_crawl(...)
loop per event
Crawler-->>API: partial result
API->>API: base64-encode PDF bytes (if any)
API->>Encoder: jsonable_encoder(partial)
API-->>Client: event chunk
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
docs/examples/docker/demo_docker_api.py (1)
21-26
: Fix duplicate assignments of BASE_URL and SIMPLE_URL.Second assignments override the first and can confuse users.
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020") -BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235") +BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235") @@ -SIMPLE_URL = "https://example.com" # For demo purposes -SIMPLE_URL = "https://httpbin.org/html" +SIMPLE_URL = "https://httpbin.org/html"deploy/docker/api.py (3)
475-501
: Single-URL path returns a non-iterable; serialization shape changed unexpectedly.
- arun returns a single result (not iterable); iterating over it is a bug.
- Wrapping with to_serializable_dict changes response structure ({"type":"dict","value":...}). Streaming still returns plain dicts — inconsistent API.
Apply this diff to fix both:
- results = await partial_func() + results = await partial_func() + results_list = results if isinstance(results, list) else [results] @@ - for result in results: + for result in results_list: result_dict = result.model_dump() # If PDF exists, encode it to base64 if result_dict.get('pdf') is not None: result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8') - - processed_results.append(to_serializable_dict(result_dict)) + # Keep response shape consistent with streaming (plain JSON-serializable dict) + processed_results.append(jsonable_encoder(result_dict))
559-564
: Respect rate limiter enabled flag (parity with non-streaming).Streaming path always enables the rate limiter; non-streaming honors config flag.
- dispatcher = MemoryAdaptiveDispatcher( - memory_threshold_percent=config["crawler"]["memory_threshold_percent"], - rate_limiter=RateLimiter( - base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"]) - ) - ) + dispatcher = MemoryAdaptiveDispatcher( + memory_threshold_percent=config["crawler"]["memory_threshold_percent"], + rate_limiter=RateLimiter( + base_delay=tuple(config["crawler"]["rate_limiter"]["base_delay"]) + ) if config["crawler"]["rate_limiter"]["enabled"] else None + )
541-546
: Normalize URLs (add scheme) in streaming path.Non-streaming adds https:// when missing; streaming does not, causing failures.
browser_config = BrowserConfig.load(browser_config) @@ crawler_config = CrawlerRunConfig.load(crawler_config) crawler_config.stream = True + + # Normalize URLs to include scheme (match non-streaming behavior) + urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]
🧹 Nitpick comments (11)
deploy/docker/crawler_pool.py (4)
29-31
: Use a non-deprecated hash and avoid security-linter noise.sha1 is flagged as insecure. While used as a key, prefer sha256 for hygiene.
- return hashlib.sha1(json_payload.encode()).hexdigest() + return hashlib.sha256(json_payload.encode()).hexdigest()
41-47
: Normalize message punctuation and avoid duplicate wrapping.
- Replace EN DASH with hyphen to avoid linter noise.
- The except MemoryError re-wrap adds redundant prefix.
- if psutil.virtual_memory().percent >= MEM_LIMIT: - raise MemoryError("RAM pressure – new browser denied") + if psutil.virtual_memory().percent >= MEM_LIMIT: + raise MemoryError("RAM pressure - new browser denied") @@ - except MemoryError as e: - raise MemoryError(f"RAM pressure – new browser denied: {e}") + except MemoryError as e: + raise
37-43
: Avoid heavy work under the global lock (scalability).Creating/starting the crawler under LOCK blocks all callers. Use a two-phase approach (check/insert placeholder under lock, create/start outside, finalize under lock), or store an asyncio.Task/Future in POOL to dedupe concurrent requests for the same signature.
38-38
: Remove stray semicolon.Style/lint nit.
- LAST_USED[sig] = time.time(); + LAST_USED[sig] = time.time()docs/examples/docker/demo_docker_api.py (3)
1289-1296
: Align PDF demo output with API payload shape.The PDF result typically exposes extracted_content; text may be nested. Use a safe fallback.
- if data.get("results"): - first = data["results"][0] - text_snippet = (first.get("text") or "")[:500] - print("Extracted text (first 500 chars):") - print(text_snippet) + if data.get("results"): + first = data["results"][0] + text = first.get("extracted_content") or first.get("text") or "" + if isinstance(text, dict): + text = text.get("text") or "" + print("Extracted text (first 500 chars):") + print((text or "")[:500])
1311-1317
: Comment contradicts code.You're explicitly setting a PDF strategy; remove “Default strategy if not set” in the comment.
30-30
: Nit: add .pdf suffix for clarity (optional).The detector handles content-type; adding .pdf can avoid an extra HEAD/GET sometimes.
tests/docker/test_rest_api_pdf_crawl.py (2)
15-16
: Remove duplicate BASE_URL assignment.Second line overrides the first; keep one.
-BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:11235") # If server is running in Docker, use the host's IP -BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020") # If server is running in dev debug mode +BASE_URL = os.getenv("CRAWL4AI_TEST_URL", "http://localhost:8020")
99-103
: Be robust to string vs dict extracted_content.API may return extracted_content as a string. Guard before .get.
- extracted_text = result["extracted_content"].get("text", "") - assert isinstance(extracted_text, str) - assert len(extracted_text) > 0 + content = result.get("extracted_content") + extracted_text = content.get("text", "") if isinstance(content, dict) else (content or "") + assert isinstance(extracted_text, str) and len(extracted_text) > 0deploy/docker/api.py (2)
447-454
: Default PDF scraping strategies differ between non-stream and stream.Non-stream sets extract_images=False; streaming sets True. If unintentional, align defaults.
492-500
: Consider using jsonable_encoder for nested non-JSON types.If model_dump contains datetimes/enums, jsonable_encoder ensures proper conversion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
crawl4ai/__init__.py
(2 hunks)deploy/docker/api.py
(7 hunks)deploy/docker/crawler_pool.py
(3 hunks)deploy/docker/utils.py
(2 hunks)docs/examples/docker/demo_docker_api.py
(3 hunks)tests/docker/test_rest_api_pdf_crawl.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
deploy/docker/api.py (4)
deploy/docker/utils.py (1)
is_pdf_url
(130-155)crawl4ai/processors/pdf/__init__.py (2)
PDFCrawlerStrategy
(10-29)PDFContentScrapingStrategy
(31-192)crawl4ai/async_configs.py (1)
to_serializable_dict
(44-115)deploy/docker/crawler_pool.py (1)
get_crawler
(33-57)
crawl4ai/__init__.py (1)
crawl4ai/processors/pdf/__init__.py (2)
PDFCrawlerStrategy
(10-29)PDFContentScrapingStrategy
(31-192)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
AsyncWebCrawler
(53-852)crawl4ai/async_configs.py (1)
BrowserConfig
(329-621)deploy/docker/utils.py (1)
load_config
(23-41)
🪛 Ruff (0.13.1)
tests/docker/test_rest_api_pdf_crawl.py
28-28: Consider moving this statement to an else
block
(TRY300)
deploy/docker/utils.py
152-152: Do not catch blind exception: Exception
(BLE001)
deploy/docker/crawler_pool.py
30-30: Probable use of insecure hash functions in hashlib
: sha1
(S324)
38-38: Statement ends with an unnecessary semicolon
Remove unnecessary semicolon
(E703)
41-41: Abstract raise
to an inner function
(TRY301)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
41-41: String contains ambiguous –
(EN DASH). Did you mean -
(HYPHEN-MINUS)?
(RUF001)
🔇 Additional comments (2)
crawl4ai/__init__.py (2)
14-15
: LGTM: Publicly exporting PDF strategies.Export additions look correct and match usages in API and demos.
Please confirm crawl4ai/processors/pdf/init.py is part of the installable package (MANIFEST/pyproject) so these symbols are always available.
134-136
: LGTM: all includes new PDF symbols.Consistent with imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/docker/test_rest_api_pdf_crawl.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
tests/docker/test_rest_api_pdf_crawl.py
28-28: Consider moving this statement to an else
block
(TRY300)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (7)
deploy/docker/utils.py (1)
121-129
: Avoid blind exception catch in verify_email_domain.Catching Exception hides real issues; narrow to dns.resolver.* and log.
- except Exception as e: - return False + except (dns.resolver.NoAnswer, dns.resolver.NXDOMAIN, dns.resolver.NoNameservers, dns.exception.DNSException): + return Falsedocs/examples/docker/demo_docker_api.py (1)
1287-1297
: Use rich Console for output to stay consistent with the rest of the demo.Replace print calls with console.print panels/snippets.
tests/docker/test_rest_api_pdf_crawl.py (1)
101-116
: Add a test for mixed URL types (should 400).API now rejects mixed PDF/non-PDF batches; add coverage.
Proposed test:
@pytest.mark.asyncio async def test_pdf_mixed_urls_rejected(async_client: httpx.AsyncClient): payload = { "urls": [PDF_TEST_URL, "https://example.com"], "browser_config": {"type": "BrowserConfig", "params": {"headless": True}}, "crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "BYPASS"}} } resp = await async_client.post("/crawl", json=payload) assert resp.status_code == 400 detail = resp.json().get("detail", "") assert "Mix of PDF and non-PDF URLs" in detaildeploy/docker/crawler_pool.py (2)
18-31
: Use fully-qualified strategy identifier in pool signature.Class name alone can collide across modules; hash module + qualname.
- if crawler_strategy is not None: - payload["strategy"] = crawler_strategy.__class__.__name__ + if crawler_strategy is not None: + cls = crawler_strategy.__class__ + payload["strategy"] = f"{cls.__module__}.{cls.__qualname__}"
33-50
: Exception handling nits: remove unused var, preserve traceback.Tidy per Ruff BLE001/B904 and avoid semicolons.
- POOL[sig] = crawler; LAST_USED[sig] = time.time() + POOL[sig] = crawler + LAST_USED[sig] = time.time() return crawler - except MemoryError as e: - raise - except Exception as e: - raise RuntimeError(f"Failed to start browser: {e}") + except MemoryError: + raise + except Exception as e: + raise RuntimeError(f"Failed to start browser: {e}") from edeploy/docker/api.py (2)
443-452
: Compute is_pdf with all(...) for clarity and correctness.You already reject mixed sets; using all() communicates intent and avoids confusion.
- is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls)) - is_pdf = any(is_pdf_flags) + is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls)) + is_pdf = all(is_pdf_flags) if any(is_pdf_flags) and not all(is_pdf_flags):
554-566
: Apply all(...) in streaming path too to mirror non-streaming logic.Keeps behavior consistent.
- is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls)) - is_pdf = any(is_pdf_flags) + is_pdf_flags = await asyncio.gather(*(is_pdf_url(url) for url in urls)) + is_pdf = all(is_pdf_flags) if any(is_pdf_flags) and not all(is_pdf_flags):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
deploy/docker/api.py
(7 hunks)deploy/docker/crawler_pool.py
(2 hunks)deploy/docker/utils.py
(2 hunks)docs/examples/docker/demo_docker_api.py
(3 hunks)tests/docker/test_rest_api_pdf_crawl.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/docker/api.py (3)
deploy/docker/utils.py (1)
is_pdf_url
(130-157)crawl4ai/processors/pdf/__init__.py (2)
PDFCrawlerStrategy
(10-29)PDFContentScrapingStrategy
(31-192)deploy/docker/crawler_pool.py (1)
get_crawler
(33-59)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
AsyncWebCrawler
(53-852)crawl4ai/async_configs.py (1)
BrowserConfig
(329-621)deploy/docker/utils.py (1)
load_config
(23-41)
🪛 Ruff (0.13.1)
deploy/docker/api.py
446-449: Abstract raise
to an inner function
(TRY301)
560-563: Abstract raise
to an inner function
(TRY301)
deploy/docker/crawler_pool.py
42-42: Abstract raise
to an inner function
(TRY301)
42-42: Avoid specifying long messages outside the exception class
(TRY003)
45-45: Multiple statements on one line (semicolon)
(E702)
47-47: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
49-49: Do not catch blind exception: Exception
(BLE001)
50-50: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
50-50: Avoid specifying long messages outside the exception class
(TRY003)
tests/docker/test_rest_api_pdf_crawl.py
30-30: Consider moving this statement to an else
block
(TRY300)
🔇 Additional comments (1)
deploy/docker/api.py (1)
498-508
: LGTM: JSON-safe result shaping with base64 PDF bytes.Good normalization and use of jsonable_encoder maintains streaming/non-streaming parity.
Please confirm no nested Pydantic models remain in result_dict that would need model_dump before encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
docs/examples/docker/demo_docker_api.py (6)
28-28
: Make PDF_URL configurable via env for reliability.Allow overriding the demo PDF URL without code edits (useful behind firewalls or when the remote file changes).
Apply:
-PDF_URL = "https://arxiv.org/pdf/2310.06825.pdf" # For PDF demo +PDF_URL = os.getenv("PDF_URL", "https://arxiv.org/pdf/2310.06825.pdf") # For PDF demo
1265-1283
: Optionally showcase backend auto-detection by omitting scraping_strategy.Since the API now selects PDF strategies for .pdf URLs when no custom strategy is provided, add a variant (or toggle) that relies on auto‑detection to demonstrate the new default behavior.
Apply:
"crawler_config": { "type": "CrawlerRunConfig", "params": { "cache_mode": "BYPASS", - "scraping_strategy": { - "type": "PDFContentScrapingStrategy", - "params": { - "extract_images": False, - "save_images_locally": False, - "batch_size": 2 - } - } } }
1285-1298
: Reuse make_request for consistency and richer error output.Unifies logging, timing, and error handling with other demos.
Apply:
- resp = await client.post("/crawl", json=payload) - resp.raise_for_status() - data = resp.json() - print("=== Demo: PDF Crawl ===") - print("Success:", data.get("success")) - print("Number of results:", len(data.get("results", []))) - if data.get("results"): - first = data["results"][0] - text = first.get("extracted_content") or first.get("text") or "" - if isinstance(text, dict): - text = text.get("text") or "" - print("Extracted text (first 500 chars):") - print((text or "")[:500]) + results = await make_request(client, "/crawl", payload, "Demo PDF: Crawl") + if results: + first = results[0] + text = first.get("extracted_content") or first.get("text") or "" + if isinstance(text, dict): + text = text.get("text") or "" + console.print(Panel((text or "")[:500], title="Extracted text (first 500 chars)", + border_style="cyan", expand=False))
1306-1324
: Optionally rely on auto-detection in streaming demo too.Mirror the non‑streaming suggestion to demonstrate default PDF handling.
Apply:
"params": { "stream": True, "cache_mode": "BYPASS", - "scraping_strategy": { - "type": "PDFContentScrapingStrategy", - "params": { - "extract_images": False, - "save_images_locally": False, - "batch_size": 2 - } - } }
1326-1331
: Enhance streaming UX: show text chunks when present.Extend stream_request to print short snippets when a streamed line includes extracted text (common for PDFs).
You can update stream_request (outside this hunk) along these lines:
# inside the async for line in response.aiter_lines() loop, after parsing `data` if isinstance(data, dict): text_chunk = data.get("extracted_content") or data.get("text") if isinstance(text_chunk, dict): text_chunk = text_chunk.get("text") if isinstance(text_chunk, str) and text_chunk.strip(): snippet = text_chunk[:200].replace("\n", " ") console.print(f" [cyan]Text chunk:[/] {snippet}...")
1366-1369
: Gate PDF demos with an env toggle instead of comments.Keeps them off by default but easy to enable in CI/local runs.
Apply:
- # await demo_pdf_crawl_stream(client) - # await demo_pdf_crawl(client) + if os.getenv("RUN_PDF_DEMOS") == "1": + await demo_pdf_crawl_stream(client) + await demo_pdf_crawl(client)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/docker/utils.py
(2 hunks)docs/examples/docker/demo_docker_api.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/docker/utils.py
🔇 Additional comments (3)
docs/examples/docker/demo_docker_api.py (3)
1265-1283
: browser_config now present — resolves prior validation issue.This addresses the earlier 422 risk and aligns with other demos.
1306-1324
: browser_config present in streaming payload — resolved.This fixes the previously flagged schema mismatch for streaming.
1265-1283
: Ignore addition ofcrawler_strategy
in demo payload
The REST handler doesn’t accept acrawler_strategy
field—it's determined server-side from the PDF flag.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
deploy/docker/crawler_pool.py (6)
18-30
: Make strategy fingerprint more robust to avoid pool key collisions.Hashing only the class name may collide across differently configured instances of the same strategy. Prefer using a stable per-strategy signature or config when available, with class name as a fallback.
Apply this diff:
- if crawler_strategy is not None: - payload["strategy"] = crawler_strategy.__class__.__name__ + if crawler_strategy is not None: + # Prefer a stable, user-provided signature/config if present + strat_sig = getattr(crawler_strategy, "signature", None) + if callable(strat_sig): + try: + strat_sig = strat_sig() + except Exception: + strat_sig = None + if strat_sig is None: + strat_cfg = getattr(crawler_strategy, "to_dict", None) + if callable(strat_cfg): + with suppress(Exception): + strat_sig = strat_cfg() + payload["strategy"] = strat_sig or crawler_strategy.__class__.__name__
47-51
: Tighten exception handling and preserve the original cause.Remove the unused variable, avoid blind catch if possible, and chain the original error.
Apply this diff:
- except MemoryError as e: - raise - except Exception as e: - raise RuntimeError(f"Failed to start browser: {e}") + except MemoryError: + raise + except Exception as e: + # Consider narrowing the exception types if feasible. + raise RuntimeError("Failed to start browser") from e
45-45
: Split statements; fix Ruff E702.Avoid multiple statements on one line.
Apply this diff:
- POOL[sig] = crawler; LAST_USED[sig] = time.time() + POOL[sig] = crawler + LAST_USED[sig] = time.time()
60-64
: Close crawlers outside the lock to avoid blocking get_crawler.Awaiting .close() while holding LOCK can stall all callers.
Apply this diff:
-async def close_all(): - async with LOCK: - await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True) - POOL.clear(); LAST_USED.clear() +async def close_all(): + async with LOCK: + crawlers = list(POOL.values()) + POOL.clear() + LAST_USED.clear() + await asyncio.gather(*(c.close() for c in crawlers), return_exceptions=True)
65-73
: Avoid awaiting close() inside janitor’s critical section.Collect stale entries under LOCK, release it for the close awaits, then clean maps under LOCK. Also fix semicolons.
Apply this diff:
async def janitor(): while True: await asyncio.sleep(60) now = time.time() - async with LOCK: - for sig, crawler in list(POOL.items()): - if now - LAST_USED[sig] > IDLE_TTL: - with suppress(Exception): await crawler.close() - POOL.pop(sig, None); LAST_USED.pop(sig, None) + to_close = [] + async with LOCK: + for sig, crawler in list(POOL.items()): + if now - LAST_USED.get(sig, 0) > IDLE_TTL: + to_close.append((sig, crawler)) + for sig, crawler in to_close: + with suppress(Exception): + await crawler.close() + async with LOCK: + for sig, _ in to_close: + POOL.pop(sig, None) + LAST_USED.pop(sig, None)
33-33
: Type hint crawler_strategy to the expected protocol/type.Optional[object] is too broad; prefer the concrete AsyncCrawlerStrategy (or a minimal Protocol exposing signature()/to_dict()) to improve correctness and editor support.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/crawler_pool.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
AsyncWebCrawler
(53-852)crawl4ai/async_configs.py (1)
BrowserConfig
(329-621)deploy/docker/utils.py (1)
load_config
(23-41)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py
42-42: Abstract raise
to an inner function
(TRY301)
42-42: Avoid specifying long messages outside the exception class
(TRY003)
45-45: Multiple statements on one line (semicolon)
(E702)
47-47: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
49-49: Do not catch blind exception: Exception
(BLE001)
50-50: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
50-50: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
deploy/docker/crawler_pool.py (1)
52-59
: Good fix for prior UnboundLocal/locking concerns.Initializing sig and guarding POOL/LAST_USED updates under LOCK resolves the earlier crash/race noted in previous review.
from typing import Dict, Optional | ||
from crawl4ai import AsyncWebCrawler, BrowserConfig | ||
from typing import Dict | ||
from utils import load_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Verify import path reliability for utils.
Absolute import from utils import load_config
may break if deploy/docker isn't on sys.path. Prefer a package-relative import or confirm deploy/docker is a package with init.py and the entrypoint runs from this dir.
Run:
🏁 Script executed:
#!/bin/bash
# Check if deploy/docker is a package and whether utils is importable.
set -euo pipefail
if [ -f "deploy/docker/__init__.py" ]; then
echo "deploy/docker is a package."
else
echo "deploy/docker is NOT a package; relative imports won't work."
fi
echo "Listing files next to crawler_pool.py:"
ls -la deploy/docker || true
echo "Searching for demo entrypoint:"
fd -a --exact-path demo_docker_api.py || true
Length of output: 1558
Convert utils import to package-relative
- Add an empty
deploy/docker/__init__.py
to make it a package - In
deploy/docker/crawler_pool.py
, change
from utils import load_config
to
from .utils import load_config
🤖 Prompt for AI Agents
In deploy/docker/crawler_pool.py around line 6, the import uses a top-level
module import "from utils import load_config" which breaks package-relative
resolution; add an empty deploy/docker/__init__.py to make the directory a
package and change the import to use a relative import (from .utils import
load_config) so the module is resolved within the package namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
deploy/docker/crawler_pool.py (4)
18-30
: Pool key should include strategy module and config, not only class name.Class name alone can collide; different strategy instances/configs will be coalesced.
- if crawler_strategy is not None: - payload["strategy"] = crawler_strategy.__class__.__name__ + if crawler_strategy is not None: + strategy_cls = crawler_strategy.__class__ + payload["strategy"] = f"{strategy_cls.__module__}.{strategy_cls.__name__}" + # Include config when available to avoid reusing incompatible crawlers + if hasattr(crawler_strategy, "to_dict"): + payload["strategy_config"] = crawler_strategy.to_dict() + else: + payload["strategy_repr"] = repr(crawler_strategy)
76-79
: Avoid double-wrapping exceptions; preserve original traceback.Outer catch re-wraps errors already handled inside; also flagged by Ruff (BLE001/B904/TRY003).
- except MemoryError: - raise - except Exception as e: - raise RuntimeError(f"Failed to start browser: {e}") + except MemoryError: + raise
91-95
: Don’t await close() while holding the pool lock.Release the lock before awaiting to avoid head‑of‑line blocking other callers.
async def close_all(): - async with LOCK: - await asyncio.gather(*(c.close() for c in POOL.values()), return_exceptions=True) - POOL.clear(); LAST_USED.clear() + async with LOCK: + crawlers = list(POOL.values()) + POOL.clear(); LAST_USED.clear() + await asyncio.gather(*(c.close() for c in crawlers), return_exceptions=True)
96-105
: Janitor awaits under LOCK; restructure to close outside the lock.Prevents blocking get_crawler and other pool ops during close.
async def janitor(): while True: await asyncio.sleep(60) now = time.time() - async with LOCK: - for sig, crawler in list(POOL.items()): - if now - LAST_USED[sig] > IDLE_TTL: - with suppress(Exception): await crawler.close() - POOL.pop(sig, None); LAST_USED.pop(sig, None) + async with LOCK: + expired = [(sig, POOL[sig]) for sig, ts in list(LAST_USED.items()) if now - ts > IDLE_TTL] + for sig, crawler in expired: + with suppress(Exception): + await crawler.close() + async with LOCK: + POOL.pop(sig, None); LAST_USED.pop(sig, None)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/crawler_pool.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (2)
crawl4ai/async_webcrawler.py (1)
AsyncWebCrawler
(53-852)crawl4ai/async_configs.py (1)
BrowserConfig
(329-621)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py
51-51: Abstract raise
to an inner function
(TRY301)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
68-68: Avoid specifying long messages outside the exception class
(TRY003)
78-78: Do not catch blind exception: Exception
(BLE001)
79-79: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
79-79: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
deploy/docker/crawler_pool.py (2)
6-6
: Use package‑relative import to avoid runtime import errors.Switch to a relative import so this works when deploy/docker is a package.
-from utils import load_config +from .utils import load_config
81-89
: Good: pool mutations guarded with LOCK in finally.Prevents races when updating LAST_USED or cleaning up after failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/docker/crawler_pool.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
AsyncWebCrawler
(53-852)crawl4ai/async_configs.py (1)
BrowserConfig
(329-621)deploy/docker/utils.py (1)
load_config
(23-41)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py
59-59: Abstract raise
to an inner function
(TRY301)
59-59: Avoid specifying long messages outside the exception class
(TRY003)
59-59: String contains ambiguous –
(EN DASH). Did you mean -
(HYPHEN-MINUS)?
(RUF001)
77-77: Avoid specifying long messages outside the exception class
(TRY003)
87-87: Do not catch blind exception: Exception
(BLE001)
88-88: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
88-88: Avoid specifying long messages outside the exception class
(TRY003)
if crawler_strategy is not None: | ||
payload["strategy"] = crawler_strategy.__class__.__name__ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include strategy configuration in the pool signature.
Keying pooled crawlers only by crawler_strategy.__class__.__name__
collapses all instances of the same strategy class into a single slot. If two callers supply the same strategy class with different constructor arguments (e.g., PDFCrawlerStrategy(ocr=True)
vs PDFCrawlerStrategy(ocr=False)
), the second caller will be handed the already-started crawler that still holds the first instance, so their requested configuration is silently ignored. That’s a functional bug for any strategy with mutable/per-instance configuration. Please extend the signature to encode the strategy’s effective settings (e.g., module-qualified name plus a stable serialization of its config) so distinct instances don’t collide.
🤖 Prompt for AI Agents
In deploy/docker/crawler_pool.py around lines 25 to 27, the pool signature only
keys strategies by crawler_strategy.__class__.__name__, causing different
instances of the same class with different constructor arguments to collide; fix
this by including a stable, deterministic encoding of the instance configuration
in the payload key: derive the module-qualified class name (e.g.,
strategy.__class__.__module__ + "." + strategy.__class__.__name__) and a
serializable representation of the strategy's effective settings (prefer a
to_dict()/asdict() method on the strategy or fall back to
vars()/inspect.getfullargspec/init args), then JSON-serialize that config with
sort_keys=True (and avoid non-serializable members) and add it to payload (e.g.,
payload["strategy"] = "<module.ClassName>", payload["strategy_config"] =
"<stable-json>") so distinct configured instances do not collide.
deploy/docker/crawler_pool.py
Outdated
except Exception as e: | ||
fut.set_exception(e) | ||
raise RuntimeError("Failed to start browser") from e | ||
finally: | ||
async with LOCK: | ||
CREATING.pop(sig, None) | ||
|
||
# Await the shared result (crawler or the same exception) | ||
return await fut | ||
|
||
except MemoryError: | ||
raise | ||
except Exception as e: | ||
raise RuntimeError(f"Failed to start browser: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let cancellations propagate instead of masking them.
Both except Exception
handlers convert asyncio.CancelledError
into a RuntimeError
, so a cancelled request/worker can no longer abort cleanly—it shows up as an internal failure and keeps the pool bookkeeping alive when it should unwind. We need to special-case CancelledError
(cancel the shared future if we’re the creator and re-raise immediately) before the generic handlers.
- if creator:
- # Only the creator actually creates/starts the crawler
- try:
+ if creator:
+ # Only the creator actually creates/starts the crawler
+ try:
crawler = AsyncWebCrawler(
config=cfg, thread_safe=False, crawler_strategy=crawler_strategy
)
await crawler.start()
async with LOCK:
POOL[sig] = crawler
LAST_USED[sig] = time.time()
fut.set_result(crawler)
- except Exception as e:
+ except asyncio.CancelledError:
+ if not fut.done():
+ fut.cancel()
+ raise
+ except Exception as e:
fut.set_exception(e)
raise RuntimeError("Failed to start browser") from e
finally:
async with LOCK:
CREATING.pop(sig, None)
# Await the shared result (crawler or the same exception)
return await fut
except MemoryError:
raise
+ except asyncio.CancelledError:
+ raise
except Exception as e:
- raise RuntimeError(f"Failed to start browser: {e}")
+ raise RuntimeError("Failed to start browser") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
except Exception as e: | |
fut.set_exception(e) | |
raise RuntimeError("Failed to start browser") from e | |
finally: | |
async with LOCK: | |
CREATING.pop(sig, None) | |
# Await the shared result (crawler or the same exception) | |
return await fut | |
except MemoryError: | |
raise | |
except Exception as e: | |
raise RuntimeError(f"Failed to start browser: {e}") | |
if creator: | |
# Only the creator actually creates/starts the crawler | |
try: | |
crawler = AsyncWebCrawler( | |
config=cfg, thread_safe=False, crawler_strategy=crawler_strategy | |
) | |
await crawler.start() | |
async with LOCK: | |
POOL[sig] = crawler | |
LAST_USED[sig] = time.time() | |
fut.set_result(crawler) | |
except asyncio.CancelledError: | |
if not fut.done(): | |
fut.cancel() | |
raise | |
except Exception as e: | |
fut.set_exception(e) | |
raise RuntimeError("Failed to start browser") from e | |
finally: | |
async with LOCK: | |
CREATING.pop(sig, None) | |
# Await the shared result (crawler or the same exception) | |
return await fut | |
except MemoryError: | |
raise | |
except asyncio.CancelledError: | |
raise | |
except Exception as e: | |
raise RuntimeError("Failed to start browser") from e |
🧰 Tools
🪛 Ruff (0.13.1)
77-77: Avoid specifying long messages outside the exception class
(TRY003)
87-87: Do not catch blind exception: Exception
(BLE001)
88-88: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
88-88: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In deploy/docker/crawler_pool.py around lines 75 to 88, the generic except
blocks are converting asyncio.CancelledError into RuntimeError; change this by
adding a specific except asyncio.CancelledError handler before the generic
Exception handlers: if the current task is the creator (i.e., you set fut in
this scope) then cancel the shared future (fut.cancel()) or set its exception
accordingly, ensure CREATING cleanup still runs, and immediately re-raise the
CancelledError so cancellation propagates; then keep the existing MemoryError
and generic Exception handlers after that.
This reverts commit bc3b3d1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/examples/docker/demo_docker_api.py (1)
1285-1295
: Reuse shared helpers for payload logging & output consistencyEvery other demo routes through
make_request
, which already prints the payload, formats the response summary, and centralizes error handling. Dropping to rawclient.post
here duplicates some of that logic and results in mixedconsole.print
output in the demo. Suggest delegating tomake_request
(and returning early if it yieldsNone
) so the PDF flow behaves like the rest.- resp = await client.post("/crawl", json=payload) - resp.raise_for_status() - data = resp.json() - print("=== Demo: PDF Crawl ===") - print("Success:", data.get("success")) - print("Number of results:", len(data.get("results", []))) - if data.get("results"): - first = data["results"][0] - text_snippet = (first.get("text") or "")[:500] - print("Extracted text (first 500 chars):") - print(text_snippet) + results = await make_request(client, "/crawl", payload, "Demo 10: PDF Crawl") + if not results: + return + + first = results[0] + text_snippet = (first.get("text") or "")[:500] + console.print("[cyan]Extracted text (first 500 chars):[/]") + console.print(text_snippet)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/docker/crawler_pool.py
(3 hunks)docs/examples/docker/demo_docker_api.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
deploy/docker/crawler_pool.py (3)
crawl4ai/async_webcrawler.py (1)
AsyncWebCrawler
(53-852)crawl4ai/async_configs.py (1)
BrowserConfig
(329-621)deploy/docker/utils.py (1)
load_config
(23-41)
🪛 Ruff (0.13.1)
deploy/docker/crawler_pool.py
40-40: Statement ends with an unnecessary semicolon
Remove unnecessary semicolon
(E703)
43-43: Abstract raise
to an inner function
(TRY301)
43-43: Avoid specifying long messages outside the exception class
(TRY003)
43-43: String contains ambiguous –
(EN DASH). Did you mean -
(HYPHEN-MINUS)?
(RUF001)
🔇 Additional comments (2)
deploy/docker/crawler_pool.py (2)
24-30
: Include strategy configuration in the pool signature.Line 27 only keys on
crawler_strategy.__class__.__name__
, so two callers using the same strategy class with different parameters will collide and get a crawler that was started with someone else’s settings. That breaks correctness for any strategy with per-instance config (e.g., PDF crawlers with different OCR flags) and will be extremely hard to debug.Please extend the signature payload to encode both the module-qualified class name and a stable serialization of the strategy’s effective configuration (e.g., via
to_dict()
/vars()
fallback) before hashing.- if crawler_strategy is not None: - payload["strategy"] = crawler_strategy.__class__.__name__ - - json_payload = json.dumps(payload, sort_keys=True, separators=(",", ":")) + if crawler_strategy is not None: + strategy_cls = crawler_strategy.__class__ + payload["strategy"] = f"{strategy_cls.__module__}.{strategy_cls.__qualname__}" + if hasattr(crawler_strategy, "to_dict"): + payload["strategy_config"] = crawler_strategy.to_dict() + elif hasattr(crawler_strategy, "__dict__"): + payload["strategy_config"] = { + k: v + for k, v in vars(crawler_strategy).items() + if not callable(v) and not k.startswith("_") + } + else: + payload["strategy_config"] = repr(crawler_strategy) + + json_payload = json.dumps( + payload, + sort_keys=True, + separators=(",", ":"), + default=lambda value: repr(value), + )
53-59
: Guard pool bookkeeping with the lock.Lines 53-59 mutate
POOL
/LAST_USED
without holdingLOCK
, so concurrent requests can race the cleanup and either resurrect stale entries or throwKeyError
. We already require the lock for all other pool touches—this block needs the same protection.- finally: - if sig and sig in POOL: - LAST_USED[sig] = time.time() - else: - # If we failed to start the browser, we should remove it from the pool - if sig: - POOL.pop(sig, None) - LAST_USED.pop(sig, None) + finally: + if sig: + async with LOCK: + if sig in POOL: + LAST_USED[sig] = time.time() + else: + POOL.pop(sig, None) + LAST_USED.pop(sig, None)
Summary
This PR updates the crawl request handling to correctly apply the PDFCrawlerStrategy and PDFContentScrapingStrategy only when the URLs point to PDF files. Previously, all URLs were treated as web pages by default, which caused PDFs to be misprocessed.
Fixes issues where PDFs were downloaded but not processed correctly, e.g., missing content extraction.
List of files changed and why
api.py
– Modified handle_crawl_request and handle_stream_crawl_request to dynamically apply PDFCrawlerStrategy and PDFContentScrapingStrategy for PDF URLs.__init__.py
- Export PDFCrawlerStrategy, PDFContentScrapingStrategycrawler_pool.py
- Edit_sig()
andget_crawler
to handlecrawler_strategy
demo_docker_api.py
- Added examples/demo to illustrate PDF crawling usage.utils.py
- Addedis_pdf_url
to be able to update the strategiesHow Has This Been Tested?
Checklist:
Summary by CodeRabbit
New Features
Maintenance
Chores
Documentation
Tests