#1268 fix: update redirected_url to current page URL and enhance normalize_url function #1470
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1268 - URL resolution after JavaScript redirects
The issue was that when JavaScript redirects occurred during crawling, the crawler was capturing the original URL instead of the final redirected URL. This caused:
Root cause: The crawler was capturing
url
(initial URL) instead ofpage.url
(final URL after JavaScript navigation).Solution: Updated the redirect URL capture timing to use the final
page.url
after JavaScript navigation completes, ensuring all link processing works correctly with the actual final URL.List of files changed and why
crawl4ai/utils.py
- Enhanced URL processing functions:normalize_url()
with better empty href handling, port normalization, and validationis_external_url()
for proper www subdomain and port handling in domain comparisoncrawl4ai/async_crawler_strategy.py
- Fixed redirect URL capture timing:page.url
after JavaScript navigationcrawl4ai/async_webcrawler.py
- Updated URL parameter passing:redirected_url
is correctly passed to scraping strategycrawl4ai/content_scraping_strategy.py
- Minor adjustments for URL handling consistency:redirected_url
parameter when available for link processingHow Has This Been Tested?
Regression Tests
Performance Testing
Checklist:
My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
async_crawler_strategy.py
for the redirect URL changeI have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
The fix is production-ready and addresses the exact issue described in #1268 while maintaining full backward compatibility.