Skip to content

Conversation

fardhanrasya
Copy link

@fardhanrasya fardhanrasya commented Sep 2, 2025

Summary

Update the example of using a Filter with DefaultMarkdownGenerator so it can be copy-pasted and run directly.
Previously, the snippet was missing import asyncio and asyncio.run(main()).

List of files changed and why

  • deploy/docker/c4ai-doc-context.md – fixed example by wrapping in main() function and adding asyncio.run(main()) to make it runnable.
  • docs/md_v2/core/quickstart.md – same as above

How Has This Been Tested?

  • Ran the updated snippet locally with python example.py.
  • Verified that it successfully crawls Hacker News and prints raw/fit Markdown lengths.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code (N/A – documentation only)
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests (N/A – documentation only)
  • New and existing unit tests pass locally with my changes (not applicable for docs)

Summary by CodeRabbit

  • Documentation
    • Updated examples to use an asynchronous main coroutine with asyncio.run for clearer execution flow.
    • Demonstrated configuring cache behavior via a public enum in the sample configuration.
    • Consolidated creation of the markdown generator and crawler config inside the async function.
    • Showed async context usage for the crawler and updated invocation pattern.
    • Added sample output of Raw Markdown and Fit Markdown lengths.
    • No changes to public APIs; improvements are example-only.

Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Documentation examples were updated to show an asynchronous usage pattern: imports now include asyncio and CacheMode; logic is wrapped in an async main() with asyncio.run(main()); AsyncWebCrawler is used within an async context; CrawlerRunConfig demonstrates cache_mode=CacheMode.BYPASS. No library APIs changed.

Changes

Cohort / File(s) Summary of Changes
Docs: Async example update
docs/md_v2/core/quickstart.md
Rewrote example to use async main(), asyncio.run entrypoint, AsyncWebCrawler in async context, and CacheMode in CrawlerRunConfig. Structural/indentation adjustments only; no API changes.
Deploy docs: Async example update
deploy/docker/c4ai-doc-context.md
Updated example to async pattern with asyncio, async main(), async with AsyncWebCrawler, explicit CacheMode.BYPASS in config, and added prints for raw/fit markdown lengths. No API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Script as Script (__main__)
  participant Crawler as AsyncWebCrawler
  participant Site as Target Website

  User->>Script: Run `python script.py`
  Script->>Script: asyncio.run(main())
  Script->>Crawler: async with AsyncWebCrawler()
  Crawler->>Site: Fetch & process content (async)
  Site-->>Crawler: HTML/Content
  Crawler-->>Script: Result (raw/fit markdown)
  Script->>User: Print lengths and exit
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I hop through loops of async night,
Awaiting awaits in moonlit byte,
With CacheMode set to swift bypass,
I nibble docs and nose the class—
When main() runs, I thump with cheer,
Async burrows are crystal clear! 🐇✨

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deploy/docker/c4ai-doc-context.md (1)

4895-4899: Add a success check and guard fit_markdown to avoid None errors

If the crawl fails, or fit_markdown is None, the current prints can raise. A tiny guard improves resilience without distracting from the example.

-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun("https://news.ycombinator.com", config=config)
-        print("Raw Markdown length:", len(result.markdown.raw_markdown))
-        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://news.ycombinator.com", config=config)
+        if result.success and result.markdown:
+            print("Raw Markdown length:", len(result.markdown.raw_markdown))
+            fit = result.markdown.fit_markdown or ""
+            print("Fit Markdown length:", len(fit))
+        else:
+            print(f"Error: {result.error_message}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and 1274a81.

📒 Files selected for processing (2)
  • deploy/docker/c4ai-doc-context.md (1 hunks)
  • docs/md_v2/core/quickstart.md (1 hunks)
🔇 Additional comments (2)
docs/md_v2/core/quickstart.md (1)

62-80: Good async pattern and config usage.

Wrapping in async def main() and using asyncio.run(main()) is correct; imports and CacheMode.BYPASS example look consistent with prior notes.

deploy/docker/c4ai-doc-context.md (1)

4880-4902: Runnable async example looks good

Importing asyncio, wrapping in async main(), and using asyncio.run(main()) makes the snippet copy-paste runnable. Using CacheMode.BYPASS here is also appropriate for fresh content.

Comment on lines +100 to 122
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)

async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))

if __name__ == "__main__":
asyncio.run(main())
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Make the snippet resilient to both string and structured result.markdown shapes.

Elsewhere in the doc, result.markdown is treated like a string, but here it’s accessed as an object with raw_markdown/fit_markdown. To avoid version drift and keep this runnable across releases, compute lengths via getattr fallback.

Apply this diff:

-        print("Raw Markdown length:", len(result.markdown.raw_markdown))
-        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+        md = result.markdown
+        raw = getattr(md, "raw_markdown", md)
+        fit = getattr(md, "fit_markdown", md)
+        print("Raw Markdown length:", len(raw))
+        print("Fit Markdown length:", len(fit))

Also consider aligning the earlier “Your First Crawl” example to use the same shape (or add a short note clarifying the return type), so readers don’t get conflicting guidance. I can open a follow-up PR if you’d like.


🏁 Script executed:

#!/bin/bash
set -e

echo "Searching for AsyncWebCrawler class definition..."
grep -R "class AsyncWebCrawler" -n .

echo "Searching for arun method definition..."
grep -R "async def arun" -n .

echo "Searching for Result class definitions..."
grep -R "class .*Result" -n .

echo "Searching for .markdown usage in code (in crawl4ai folder if exists)..."
if [ -d "crawl4ai" ]; then
  grep -R "\.markdown" -n crawl4ai || echo "No .markdown in crawl4ai folder"
else
  echo "No crawl4ai directory, searching entire repo..."
  grep -R "\.markdown" -n .
fi

echo "Locating 'Your First Crawl' in docs..."
grep -R "Your First Crawl" -n docs/md_v2/core/quickstart.md

echo "Locating 'result.markdown' in docs..."
grep -R "result.markdown" -n docs/md_v2/core/quickstart.md

Length of output: 282964


Make snippet resilient to both string and structured result.markdown

File: docs/md_v2/core/quickstart.md (lines 100-122) — CrawlResult.markdown can be a plain/str-subclass or a MarkdownGenerationResult object; use getattr fallback.

-        print("Raw Markdown length:", len(result.markdown.raw_markdown))
-        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+        md = result.markdown
+        raw = getattr(md, "raw_markdown", md)
+        fit = getattr(md, "fit_markdown", md)
+        print("Raw Markdown length:", len(raw))
+        print("Fit Markdown length:", len(fit))

Also align the earlier “Your First Crawl” example (or add a short note clarifying the return shape) to avoid conflicting guidance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
md = result.markdown
raw = getattr(md, "raw_markdown", md)
fit = getattr(md, "fit_markdown", md)
print("Raw Markdown length:", len(raw))
print("Fit Markdown length:", len(fit))
if __name__ == "__main__":
asyncio.run(main())
🤖 Prompt for AI Agents
In docs/md_v2/core/quickstart.md around lines 100 to 122, the example assumes
result.markdown is always a MarkdownGenerationResult object and fails if it's a
plain string; update the snippet to handle both shapes by using a getattr
fallback (e.g., obtain raw_markdown = getattr(result.markdown, "raw_markdown",
result.markdown) and fit_markdown = getattr(result.markdown, "fit_markdown",
result.markdown)) before printing lengths, and also adjust the earlier “Your
First Crawl” example or add a short note clarifying that CrawlResult.markdown
may be either a string or a MarkdownGenerationResult object so readers aren’t
given conflicting guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant