Skip to content
Closed
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
1e1c887
fix(docker-api): migrate to modern datetime library API
emmanuel-ferdman May 13, 2025
8e3c411
Merge branch 'main' into main
emmanuel-ferdman Jul 29, 2025
7a8190e
Fix examples in README.md
NezarAli Aug 6, 2025
be63c98
feat(docker): add user-provided hooks support to Docker API
ntohidi Aug 11, 2025
88a9fbb
fix(deep-crawl): BestFirst priority inversion; remove pre-scoring tru…
ntohidi Aug 11, 2025
ecbe5ff
docs: Update URL seeding examples to use proper async context managers
SohamKukreti Aug 13, 2025
f4a4328
fix(crawler): Removed the incorrect reference in browser_config varia…
ntohidi Aug 18, 2025
dad7c51
Merge pull request #1398 from unclecode/fix/update-url-seeding-docs
ntohidi Aug 18, 2025
9447054
docs: update Docker instructions to use the latest release tag
ntohidi Aug 18, 2025
f4206d6
Merge pull request #1369 from NezarAli/main
ntohidi Aug 18, 2025
ef174a4
Merge pull request #1104 from emmanuel-ferdman/main
ntohidi Aug 20, 2025
69961cf
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Aug 20, 2025
9505102
fix(docker): Fix LLM API key handling for multi-provider support
ntohidi Aug 21, 2025
8bb0e68
Merge pull request #1422 from unclecode/fix/docker-llmEnvFile
ntohidi Aug 21, 2025
90af453
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Aug 21, 2025
c09a576
docs: update adaptive crawler docs and cache defaults; remove depreca…
SohamKukreti Aug 21, 2025
40ab287
fix(utils): Improve URL normalization by avoiding quote/unquote to pr…
ntohidi Aug 22, 2025
b1dff5a
feat: Add comprehensive website to API example with frontend
SohamKukreti Aug 24, 2025
f2da460
fix(dependencies): add cssselect to project dependencies
Thermofish Aug 25, 2025
102352e
fix(docker): resolve filter serialization and JSON encoding errors in…
ntohidi Aug 25, 2025
38f3ea4
fix(logger): ensure logger is a Logger instance in crawling strategie…
ntohidi Aug 26, 2025
159207b
feat(docker): Add temperature and base_url parameters for LLM configu…
ntohidi Aug 26, 2025
4fe2d01
Merge pull request #1440 from unclecode/feature/docker-llm-parameters
ntohidi Aug 26, 2025
cce3390
Merge pull request #1426 from unclecode/fix/update-quickstart-and-ada…
ntohidi Aug 26, 2025
2ad3fb5
feat(docker): improve docker error handling
SohamKukreti Aug 26, 2025
4e1c4bd
Merge pull request #1436 from unclecode/fix/docker-filter
ntohidi Aug 27, 2025
f7a3366
#1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig …
Ahmed-Tawfik94 Aug 28, 2025
4ed33fc
Remove deprecated test for 'proxy' parameter in BrowserConfig and upd…
Ahmed-Tawfik94 Aug 28, 2025
f566c5a
feat: add preserve_https_for_internal_links flag to maintain HTTPS du…
ntohidi Aug 28, 2025
bdacf61
feat: update documentation for preserve_https_for_internal_links. ref…
ntohidi Aug 28, 2025
70f473b
fix: drop Python 3.9 support and require Python >=3.10.
SohamKukreti Aug 28, 2025
9749e28
issue #1329 refactor(crawler): move unwanted properties to CrawlerRun…
nafeqq-1306 Aug 29, 2025
2de200c
Merge pull request #1433 from Thermofish/fix/excluded_selector
ntohidi Aug 29, 2025
6e72809
fix(auth): fixed Docker JWT authentication. ref #1442
ntohidi Sep 1, 2025
5e7fcb1
Merge pull request #1448 from unclecode/fix/https-reditrect
ntohidi Sep 1, 2025
af28e84
Merge pull request #1441 from unclecode/fix/improve-docker-error-hand…
ntohidi Sep 2, 2025
ae67d66
Merge pull request #1454 from nafeqq-1306/docstring-changes
ntohidi Sep 2, 2025
6772134
remove: delete unused yoyo snapshot subproject
ntohidi Sep 2, 2025
4878396
fix: raise error on last attempt failure in perform_completion_with_b…
ntohidi Sep 2, 2025
bc6d814
Merge pull request #1451 from unclecode/fix/remove-python3.9-version
ntohidi Sep 2, 2025
1eacea1
Merge pull request #1432 from unclecode/example/web2api-example
ntohidi Sep 3, 2025
6a3b3e9
Commit without API
Ahmed-Tawfik94 Sep 3, 2025
0482c1e
Merge pull request #1469 from unclecode/fix/docker-jwt
ntohidi Sep 4, 2025
1874a7b
fix: update option labels in request builder for clarity
Ahmed-Tawfik94 Sep 5, 2025
3bc56dd
fix: allow custom LLM providers for adaptive crawler embedding config…
ntohidi Sep 9, 2025
14b42b1
Merge pull request #1471 from unclecode/fix/adaptive-crawler-llm-config
ntohidi Sep 9, 2025
f8eaf01
Merge pull request #1467 from unclecode/fix/request-crawl-stream
ntohidi Sep 11, 2025
1717827
refactor(BrowserConfig): change deprecation warning for 'proxy' param…
Ahmed-Tawfik94 Sep 12, 2025
23431d8
Merge pull request #1389 from unclecode/fix/deep-crawl-scoring
ntohidi Sep 16, 2025
3899ac3
Merge pull request #1464 from unclecode/fix/proxy_deprecation
ntohidi Sep 16, 2025
77559f3
feat(StealthAdapter): fix stealth features for Playwright integration…
ntohidi Sep 18, 2025
d0eb5a6
Merge pull request #1501 from unclecode/fix/n-playwright-stealth
ntohidi Sep 19, 2025
a1950af
#1505 fix(api): update config handling to only set base config if not…
Ahmed-Tawfik94 Sep 22, 2025
69e8ca3
Merge pull request #1508 from unclecode/docker/base_config_overrides
ntohidi Sep 22, 2025
fef715a
Merge branch 'feature/docker-hooks' into develop
ntohidi Sep 25, 2025
3fe49a7
fix(docker-deployment): replace console.log with print for metadata e…
ntohidi Sep 25, 2025
361499d
Release v0.7.5: The Update
ntohidi Sep 29, 2025
70af81d
refactor(release): remove memory management section for cleaner docum…
ntohidi Sep 30, 2025
0d8d043
feat(docs): add brand book and page copy functionality
unclecode Sep 30, 2025
ef46df1
Update gitignore add local scripts folder
unclecode Sep 30, 2025
8d30662
fix: remove this import as it causes python to treat "json" as a vari…
Sjoeborg Oct 2, 2025
35dd206
fix: always return a list, even if we catch an exception
Sjoeborg Oct 2, 2025
408ad1b
feat(marketplace): Add Crawl4AI marketplace with secure configuration
unclecode Oct 2, 2025
749d200
fix(marketplace): Update URLs to use /marketplace path and relative A…
unclecode Oct 2, 2025
80aa6c1
Merge pull request #1530 from Sjoeborg/fix/arun-many-returns-none
ntohidi Oct 3, 2025
9292b26
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Oct 3, 2025
9900f63
Merge pull request #1531 from unclecode/develop
ntohidi Oct 3, 2025
5145d42
fix(docs): hide copy menu on non-markdown pages
unclecode Oct 3, 2025
8c62277
feat(marketplace): add sponsor logo uploads
unclecode Oct 6, 2025
d2c7f34
feat(docs): add chatgpt quick link to page actions
unclecode Oct 7, 2025
2c373f0
fix(marketplace): align admin api with backend endpoints
unclecode Oct 8, 2025
936397e
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Oct 9, 2025
611d48f
Merge branch 'develop' into release/v0.7.5
ntohidi Oct 9, 2025
5a4f21f
fix(marketplace): isolate api under marketplace prefix
unclecode Oct 9, 2025
abe8a92
fix(marketplace): resolve app detail page routing and styling issues
unclecode Oct 11, 2025
216019f
fix(marketplace): prevent hero image overflow and secondary card stre…
unclecode Oct 11, 2025
a3f057e
feat: Add hooks utility for function-based hooks with Docker client i…
ntohidi Oct 13, 2025
7dadb65
Merge branch 'develop' into release/v0.7.5
ntohidi Oct 13, 2025
4a04b85
feat: Add hooks utility for function-based hooks with Docker client i…
ntohidi Oct 13, 2025
aadab30
fix(docs): clarify Docker Hooks System with function-based API in README
ntohidi Oct 13, 2025
8fc1747
docs: Add demonstration files for v0.7.5 release, showcasing the new …
ntohidi Oct 13, 2025
c91b235
docs: Update 0.7.5 video walkthrough
ntohidi Oct 14, 2025
c7288dd
docs: add complete SDK reference documentation
unclecode Oct 19, 2025
749232b
feat: add AI assistant skill package for Crawl4AI
unclecode Oct 19, 2025
1bf85bc
fix: remove non-existent wiki link and clarify skill usage instructions
unclecode Oct 19, 2025
69d0ef8
fix: update Crawl4AI skill with corrected parameters and examples
unclecode Oct 19, 2025
c107617
fix: thoroughly verify and fix all Crawl4AI skill examples
unclecode Oct 19, 2025
6d1a398
feat(ci): split release pipeline and add Docker caching
unclecode Oct 21, 2025
f6a02c4
Merge branch 'develop' into release/v0.7.5
ntohidi Oct 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ CLAUDE.md
tests/**/test_site
tests/**/reports
tests/**/benchmark_reports

test_scripts/
docs/**/data
.codecat/

Expand Down
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,16 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added
- **πŸ”’ HTTPS Preservation for Internal Links**: New `preserve_https_for_internal_links` configuration flag
- Maintains HTTPS scheme for internal links even when servers redirect to HTTP
- Prevents security downgrades during deep crawling
- Useful for security-conscious crawling and sites supporting both protocols
- Fully backward compatible with opt-in flag (default: `False`)
- Fixes issue #1410 where HTTPS URLs were being downgraded to HTTP

## [0.7.3] - 2025-08-09

### Added
Expand Down
60 changes: 52 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,13 @@

Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.

[✨ Check out latest update v0.7.4](#-recent-updates)
[✨ Check out latest update v0.7.5](#-recent-updates)

✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
✨ New in v0.7.5: Docker Hooks System for pipeline customization, Enhanced LLM Integration with custom providers, HTTPS Preservation, and multiple community-reported bug fixes. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)

✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
✨ Recent v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)

✨ Previous v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)

<details>
<summary>πŸ€“ <strong>My Personal Story</strong></summary>
Expand Down Expand Up @@ -304,9 +306,9 @@ The new Docker implementation includes:
### Getting Started

```bash
# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.7.0
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
# Pull and run the latest release
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# Visit the playground at http://localhost:11235/playground
```
Expand Down Expand Up @@ -373,7 +375,7 @@ async def main():

async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
url="https://docs.micronaut.io/4.9.9/guide/",
config=run_config
)
print(len(result.markdown.raw_markdown))
Expand Down Expand Up @@ -425,7 +427,7 @@ async def main():
"type": "attribute",
"attribute": "src"
}
}
]
}

extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
Expand Down Expand Up @@ -544,6 +546,48 @@ async def test_news_crawl():

## ✨ Recent Updates

<details>
<summary><strong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update</strong></summary>

- **πŸ”§ Docker Hooks System**: Complete pipeline customization with user-provided Python functions:
```python
import requests

# Real working hooks for httpbin.org
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("Hook: Setting up page context")
# Block images to speed up crawling
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f"Hook: About to navigate to {url}")
# Add custom headers
await page.set_extra_http_headers({'X-Test-Header': 'crawl4ai-hooks-test'})
return page
"""
}

# Test with Docker API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {"code": hooks_config, "timeout": 30}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
```

- **πŸ€– Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration
- **πŸ”’ HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True`
- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance
- **πŸ› οΈ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration

[Full v0.7.5 Release Notes β†’](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.5.md)

</details>

<details>
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>

Expand Down
2 changes: 1 addition & 1 deletion crawl4ai/__version__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# crawl4ai/__version__.py

# This is the version that will be used for stable releases
__version__ = "0.7.4"
__version__ = "0.7.5"

# For nightly builds, this gets set during build process
__nightly_version__ = None
Expand Down
78 changes: 60 additions & 18 deletions crawl4ai/adaptive_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from pathlib import Path

from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig, LinkPreviewConfig
from crawl4ai.async_configs import CrawlerRunConfig, LinkPreviewConfig, LLMConfig
from crawl4ai.models import Link, CrawlResult
import numpy as np

Expand Down Expand Up @@ -178,7 +178,7 @@ class AdaptiveConfig:

# Embedding strategy parameters
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"
embedding_llm_config: Optional[Dict] = None # Separate config for embeddings
embedding_llm_config: Optional[Union[LLMConfig, Dict]] = None # Separate config for embeddings
n_query_variations: int = 10
coverage_threshold: float = 0.85
alpha_shape_alpha: float = 0.5
Expand Down Expand Up @@ -250,6 +250,30 @@ def validate(self):
assert 0 <= self.embedding_quality_max_confidence <= 1, "embedding_quality_max_confidence must be between 0 and 1"
assert self.embedding_quality_scale_factor > 0, "embedding_quality_scale_factor must be positive"
assert 0 <= self.embedding_min_confidence_threshold <= 1, "embedding_min_confidence_threshold must be between 0 and 1"

@property
def _embedding_llm_config_dict(self) -> Optional[Dict]:
"""Convert LLMConfig to dict format for backward compatibility."""
if self.embedding_llm_config is None:
return None

if isinstance(self.embedding_llm_config, dict):
# Already a dict - return as-is for backward compatibility
return self.embedding_llm_config

# Convert LLMConfig object to dict format
return {
'provider': self.embedding_llm_config.provider,
'api_token': self.embedding_llm_config.api_token,
'base_url': getattr(self.embedding_llm_config, 'base_url', None),
'temperature': getattr(self.embedding_llm_config, 'temperature', None),
'max_tokens': getattr(self.embedding_llm_config, 'max_tokens', None),
'top_p': getattr(self.embedding_llm_config, 'top_p', None),
'frequency_penalty': getattr(self.embedding_llm_config, 'frequency_penalty', None),
'presence_penalty': getattr(self.embedding_llm_config, 'presence_penalty', None),
'stop': getattr(self.embedding_llm_config, 'stop', None),
'n': getattr(self.embedding_llm_config, 'n', None),
}


class CrawlStrategy(ABC):
Expand Down Expand Up @@ -593,7 +617,7 @@ def _get_document_terms(self, crawl_result: CrawlResult) -> List[str]:
class EmbeddingStrategy(CrawlStrategy):
"""Embedding-based adaptive crawling using semantic space coverage"""

def __init__(self, embedding_model: str = None, llm_config: Dict = None):
def __init__(self, embedding_model: str = None, llm_config: Union[LLMConfig, Dict] = None):
self.embedding_model = embedding_model or "sentence-transformers/all-MiniLM-L6-v2"
self.llm_config = llm_config
self._embedding_cache = {}
Expand All @@ -605,14 +629,24 @@ def __init__(self, embedding_model: str = None, llm_config: Dict = None):
self._kb_embeddings_hash = None # Track KB changes
self._validation_embeddings_cache = None # Cache validation query embeddings
self._kb_similarity_threshold = 0.95 # Threshold for deduplication

def _get_embedding_llm_config_dict(self) -> Dict:
"""Get embedding LLM config as dict with fallback to default."""
if hasattr(self, 'config') and self.config:
config_dict = self.config._embedding_llm_config_dict
if config_dict:
return config_dict

# Fallback to default if no config provided
return {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}

async def _get_embeddings(self, texts: List[str]) -> Any:
"""Get embeddings using configured method"""
from .utils import get_text_embeddings
embedding_llm_config = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
embedding_llm_config = self._get_embedding_llm_config_dict()
return await get_text_embeddings(
texts,
embedding_llm_config,
Expand Down Expand Up @@ -679,8 +713,20 @@ async def map_query_semantic_space(self, query: str, n_synthetic: int = 10) -> A
Return as a JSON array of strings."""

# Use the LLM for query generation
provider = self.llm_config.get('provider', 'openai/gpt-4o-mini') if self.llm_config else 'openai/gpt-4o-mini'
api_token = self.llm_config.get('api_token') if self.llm_config else None
# Convert LLMConfig to dict if needed
llm_config_dict = None
if self.llm_config:
if isinstance(self.llm_config, dict):
llm_config_dict = self.llm_config
else:
# Convert LLMConfig object to dict
llm_config_dict = {
'provider': self.llm_config.provider,
'api_token': self.llm_config.api_token
}

provider = llm_config_dict.get('provider', 'openai/gpt-4o-mini') if llm_config_dict else 'openai/gpt-4o-mini'
api_token = llm_config_dict.get('api_token') if llm_config_dict else None

Comment on lines +728 to 730
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | πŸ”΄ Critical

Hardcoded query variations (fried rice) are a blocker.

map_query_semantic_space returns a static, irrelevant set of queries. This breaks adaptive crawling for any real query.

Immediate fix: call the LLM when configured; fallback to simple perturbations otherwise.

-        # response = perform_completion_with_backoff(
-        #     provider=provider,
-        #     prompt_with_variables=prompt,
-        #     api_token=api_token,
-        #     json_response=True
-        # )
-        # variations = json.loads(response.choices[0].message.content)
-        # # Mock data with more variations for split
-        variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']}
+        variations = None
+        if llm_config_dict:
+            try:
+                response = perform_completion_with_backoff(
+                    provider=llm_config_dict.get('provider', 'openai/gpt-4o-mini'),
+                    prompt_with_variables=prompt,
+                    api_token=llm_config_dict.get('api_token'),
+                    json_response=True
+                )
+                variations = json.loads(response.choices[0].message.content)
+            except Exception:
+                variations = None
+        if not variations or 'queries' not in variations:
+            # Lightweight deterministic fallback
+            base = query.strip().rstrip("?")
+            variations = {'queries': [
+                base,
+                f"{base} overview",
+                f"{base} tutorial",
+                f"{base} examples",
+                f"{base} best practices",
+                f"{base} troubleshooting",
+                f"{base} advanced",
+                f"{base} quick start",
+                f"{base} guide",
+                f"{base} faq",
+                f"{base} tips"
+            ]}
@@
-        other_queries = variations['queries'].copy()
+        other_queries = [q for q in variations['queries'] if q != query]

Also applies to: 741-744, 765-781

🧰 Tools
πŸͺ› Ruff (0.13.1)

728-728: Local variable provider is assigned to but never used

Remove assignment to unused variable provider

(F841)


729-729: Local variable api_token is assigned to but never used

Remove assignment to unused variable api_token

(F841)

# response = perform_completion_with_backoff(
# provider=provider,
Expand Down Expand Up @@ -843,10 +889,7 @@ async def select_links_for_expansion(

# Batch embed only uncached links
if texts_to_embed:
embedding_llm_config = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
embedding_llm_config = self._get_embedding_llm_config_dict()
new_embeddings = await get_text_embeddings(texts_to_embed, embedding_llm_config, self.embedding_model)

# Cache the new embeddings
Expand Down Expand Up @@ -1184,10 +1227,7 @@ async def update_state(self, state: CrawlState, new_results: List[CrawlResult])
return

# Get embeddings for new texts
embedding_llm_config = {
'provider': 'openai/text-embedding-3-small',
'api_token': os.getenv('OPENAI_API_KEY')
}
embedding_llm_config = self._get_embedding_llm_config_dict()
new_embeddings = await get_text_embeddings(new_texts, embedding_llm_config, self.embedding_model)

# Deduplicate embeddings before adding to KB
Expand Down Expand Up @@ -1256,10 +1296,12 @@ def _create_strategy(self, strategy_name: str) -> CrawlStrategy:
if strategy_name == "statistical":
return StatisticalStrategy()
elif strategy_name == "embedding":
return EmbeddingStrategy(
strategy = EmbeddingStrategy(
embedding_model=self.config.embedding_model,
llm_config=self.config.embedding_llm_config
)
strategy.config = self.config # Pass config to strategy
return strategy
else:
raise ValueError(f"Unknown strategy: {strategy_name}")

Expand Down
Loading