-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Open
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers
Description
crawl4ai version
0.6.3
Expected Behavior
Extraction sends the preferred content (in my case markdown) and url for extraction.
Current Behavior
https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py#L610
When using raw html, the url variable contains the full html content
Above, the _url variable is assigned and takes into account raw html. I suspect that this _url variable should be used instead of url.
This issue causes much more (in my case 5x) content sent to llm.
Is this reproducible?
Yes
Inputs Causing the Bug
use relatively large raw html as input.
Steps to Reproduce
Look at llm usage compared to the markdown size (I was doing rough calc of string length / 4)
Code snippets
OS
Linux
Python version
3.11
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
rbushria
Metadata
Metadata
Assignees
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers