Skip to content

[Bug]: Wrong URL variable used for extraction of raw html #1116

@djl0

Description

@djl0

crawl4ai version

0.6.3

Expected Behavior

Extraction sends the preferred content (in my case markdown) and url for extraction.

Current Behavior

https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py#L610
When using raw html, the url variable contains the full html content

Above, the _url variable is assigned and takes into account raw html. I suspect that this _url variable should be used instead of url.

This issue causes much more (in my case 5x) content sent to llm.

Is this reproducible?

Yes

Inputs Causing the Bug

use relatively large raw html as input.

Steps to Reproduce

Look at llm usage compared to the markdown size (I was doing rough calc of string length / 4)

Code snippets

OS

Linux

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions