-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
docs: make example of using a Filter with DefaultMarkdownGenerator
directly runnable
#1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: make example of using a Filter with DefaultMarkdownGenerator
directly runnable
#1465
Conversation
…directly runnable with asyncio
WalkthroughDocumentation examples were updated to show an asynchronous usage pattern: imports now include asyncio and CacheMode; logic is wrapped in an async main() with asyncio.run(main()); AsyncWebCrawler is used within an async context; CrawlerRunConfig demonstrates cache_mode=CacheMode.BYPASS. No library APIs changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Script as Script (__main__)
participant Crawler as AsyncWebCrawler
participant Site as Target Website
User->>Script: Run `python script.py`
Script->>Script: asyncio.run(main())
Script->>Crawler: async with AsyncWebCrawler()
Crawler->>Site: Fetch & process content (async)
Site-->>Crawler: HTML/Content
Crawler-->>Script: Result (raw/fit markdown)
Script->>User: Print lengths and exit
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/docker/c4ai-doc-context.md (1)
4895-4899
: Add a success check and guard fit_markdown to avoid None errorsIf the crawl fails, or fit_markdown is None, the current prints can raise. A tiny guard improves resilience without distracting from the example.
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun("https://news.ycombinator.com", config=config) - print("Raw Markdown length:", len(result.markdown.raw_markdown)) - print("Fit Markdown length:", len(result.markdown.fit_markdown)) + async with AsyncWebCrawler() as crawler: + result = await crawler.arun("https://news.ycombinator.com", config=config) + if result.success and result.markdown: + print("Raw Markdown length:", len(result.markdown.raw_markdown)) + fit = result.markdown.fit_markdown or "" + print("Fit Markdown length:", len(fit)) + else: + print(f"Error: {result.error_message}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
deploy/docker/c4ai-doc-context.md
(1 hunks)docs/md_v2/core/quickstart.md
(1 hunks)
🔇 Additional comments (2)
docs/md_v2/core/quickstart.md (1)
62-80
: Good async pattern and config usage.Wrapping in
async def main()
and usingasyncio.run(main())
is correct; imports andCacheMode.BYPASS
example look consistent with prior notes.deploy/docker/c4ai-doc-context.md (1)
4880-4902
: Runnable async example looks goodImporting asyncio, wrapping in async main(), and using asyncio.run(main()) makes the snippet copy-paste runnable. Using CacheMode.BYPASS here is also appropriate for fresh content.
import asyncio | ||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode | ||
from crawl4ai.content_filter_strategy import PruningContentFilter | ||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | ||
|
||
md_generator = DefaultMarkdownGenerator( | ||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | ||
) | ||
async def main(): | ||
md_generator = DefaultMarkdownGenerator( | ||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | ||
) | ||
|
||
config = CrawlerRunConfig( | ||
cache_mode=CacheMode.BYPASS, | ||
markdown_generator=md_generator | ||
) | ||
config = CrawlerRunConfig( | ||
cache_mode=CacheMode.BYPASS, | ||
markdown_generator=md_generator | ||
) | ||
|
||
async with AsyncWebCrawler() as crawler: | ||
result = await crawler.arun("https://news.ycombinator.com", config=config) | ||
print("Raw Markdown length:", len(result.markdown.raw_markdown)) | ||
print("Fit Markdown length:", len(result.markdown.fit_markdown)) | ||
async with AsyncWebCrawler() as crawler: | ||
result = await crawler.arun("https://news.ycombinator.com", config=config) | ||
print("Raw Markdown length:", len(result.markdown.raw_markdown)) | ||
print("Fit Markdown length:", len(result.markdown.fit_markdown)) | ||
|
||
if __name__ == "__main__": | ||
asyncio.run(main()) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Make the snippet resilient to both string and structured result.markdown
shapes.
Elsewhere in the doc, result.markdown
is treated like a string, but here it’s accessed as an object with raw_markdown
/fit_markdown
. To avoid version drift and keep this runnable across releases, compute lengths via getattr
fallback.
Apply this diff:
- print("Raw Markdown length:", len(result.markdown.raw_markdown))
- print("Fit Markdown length:", len(result.markdown.fit_markdown))
+ md = result.markdown
+ raw = getattr(md, "raw_markdown", md)
+ fit = getattr(md, "fit_markdown", md)
+ print("Raw Markdown length:", len(raw))
+ print("Fit Markdown length:", len(fit))
Also consider aligning the earlier “Your First Crawl” example to use the same shape (or add a short note clarifying the return type), so readers don’t get conflicting guidance. I can open a follow-up PR if you’d like.
🏁 Script executed:
#!/bin/bash
set -e
echo "Searching for AsyncWebCrawler class definition..."
grep -R "class AsyncWebCrawler" -n .
echo "Searching for arun method definition..."
grep -R "async def arun" -n .
echo "Searching for Result class definitions..."
grep -R "class .*Result" -n .
echo "Searching for .markdown usage in code (in crawl4ai folder if exists)..."
if [ -d "crawl4ai" ]; then
grep -R "\.markdown" -n crawl4ai || echo "No .markdown in crawl4ai folder"
else
echo "No crawl4ai directory, searching entire repo..."
grep -R "\.markdown" -n .
fi
echo "Locating 'Your First Crawl' in docs..."
grep -R "Your First Crawl" -n docs/md_v2/core/quickstart.md
echo "Locating 'result.markdown' in docs..."
grep -R "result.markdown" -n docs/md_v2/core/quickstart.md
Length of output: 282964
Make snippet resilient to both string and structured result.markdown
File: docs/md_v2/core/quickstart.md (lines 100-122) — CrawlResult.markdown can be a plain/str-subclass or a MarkdownGenerationResult object; use getattr fallback.
- print("Raw Markdown length:", len(result.markdown.raw_markdown))
- print("Fit Markdown length:", len(result.markdown.fit_markdown))
+ md = result.markdown
+ raw = getattr(md, "raw_markdown", md)
+ fit = getattr(md, "fit_markdown", md)
+ print("Raw Markdown length:", len(raw))
+ print("Fit Markdown length:", len(fit))
Also align the earlier “Your First Crawl” example (or add a short note clarifying the return shape) to avoid conflicting guidance.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import asyncio | |
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode | |
from crawl4ai.content_filter_strategy import PruningContentFilter | |
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | |
md_generator = DefaultMarkdownGenerator( | |
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
) | |
async def main(): | |
md_generator = DefaultMarkdownGenerator( | |
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
) | |
config = CrawlerRunConfig( | |
cache_mode=CacheMode.BYPASS, | |
markdown_generator=md_generator | |
) | |
config = CrawlerRunConfig( | |
cache_mode=CacheMode.BYPASS, | |
markdown_generator=md_generator | |
) | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun("https://news.ycombinator.com", config=config) | |
print("Raw Markdown length:", len(result.markdown.raw_markdown)) | |
print("Fit Markdown length:", len(result.markdown.fit_markdown)) | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun("https://news.ycombinator.com", config=config) | |
print("Raw Markdown length:", len(result.markdown.raw_markdown)) | |
print("Fit Markdown length:", len(result.markdown.fit_markdown)) | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
import asyncio | |
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode | |
from crawl4ai.content_filter_strategy import PruningContentFilter | |
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | |
async def main(): | |
md_generator = DefaultMarkdownGenerator( | |
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
) | |
config = CrawlerRunConfig( | |
cache_mode=CacheMode.BYPASS, | |
markdown_generator=md_generator | |
) | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun("https://news.ycombinator.com", config=config) | |
md = result.markdown | |
raw = getattr(md, "raw_markdown", md) | |
fit = getattr(md, "fit_markdown", md) | |
print("Raw Markdown length:", len(raw)) | |
print("Fit Markdown length:", len(fit)) | |
if __name__ == "__main__": | |
asyncio.run(main()) |
🤖 Prompt for AI Agents
In docs/md_v2/core/quickstart.md around lines 100 to 122, the example assumes
result.markdown is always a MarkdownGenerationResult object and fails if it's a
plain string; update the snippet to handle both shapes by using a getattr
fallback (e.g., obtain raw_markdown = getattr(result.markdown, "raw_markdown",
result.markdown) and fit_markdown = getattr(result.markdown, "fit_markdown",
result.markdown)) before printing lengths, and also adjust the earlier “Your
First Crawl” example or add a short note clarifying that CrawlResult.markdown
may be either a string or a MarkdownGenerationResult object so readers aren’t
given conflicting guidance.
Summary
Update the example of using a Filter with
DefaultMarkdownGenerator
so it can be copy-pasted and run directly.Previously, the snippet was missing
import asyncio
andasyncio.run(main())
.List of files changed and why
deploy/docker/c4ai-doc-context.md
– fixed example by wrapping inmain()
function and addingasyncio.run(main())
to make it runnable.docs/md_v2/core/quickstart.md
– same as aboveHow Has This Been Tested?
python example.py
.Checklist:
Summary by CodeRabbit