Skip to content

Conversation

Ahmed-Tawfik94
Copy link
Collaborator

Summary

Fixes #1268 - URL resolution after JavaScript redirects

The issue was that when JavaScript redirects occurred during crawling, the crawler was capturing the original URL instead of the final redirected URL. This caused:

  • Incorrect link resolution for internal navigation
  • Broken relative URL handling after redirects
  • Inaccurate external link detection

Root cause: The crawler was capturing url (initial URL) instead of page.url (final URL after JavaScript navigation).

Solution: Updated the redirect URL capture timing to use the final page.url after JavaScript navigation completes, ensuring all link processing works correctly with the actual final URL.

List of files changed and why

  • crawl4ai/utils.py - Enhanced URL processing functions:

    • Improved normalize_url() with better empty href handling, port normalization, and validation
    • Enhanced is_external_url() for proper www subdomain and port handling in domain comparison
  • crawl4ai/async_crawler_strategy.py - Fixed redirect URL capture timing:

    • Changed from capturing initial URL to final page.url after JavaScript navigation
    • Added comment explaining the change for future maintainers
  • crawl4ai/async_webcrawler.py - Updated URL parameter passing:

    • Ensured redirected_url is correctly passed to scraping strategy
    • Updated markdown generation to use redirected URL as base URL
  • crawl4ai/content_scraping_strategy.py - Minor adjustments for URL handling consistency:

    • Updated to use redirected_url parameter when available for link processing

How Has This Been Tested?

Regression Tests

  • All existing URL normalization tests still pass (20/20) ✅
  • No breaking changes to existing functionality

Performance Testing

  • URL functions execute efficiently (~0.01-0.02 seconds for 500 operations) ✅

Checklist:

  • My code follows the style guidelines of this project

    • Uses proper Python conventions and type hints
    • Consistent with existing codebase patterns
    • Proper error handling and validation
  • I have performed a self-review of my own code

    • Changes are surgical and focused on the specific issue
    • No unnecessary modifications
    • Code is readable and maintainable
  • I have commented my code, particularly in hard-to-understand areas

    • Added explanatory comment in async_crawler_strategy.py for the redirect URL change
    • Functions have proper docstrings with parameter descriptions
    • Complex logic is explained where needed
  • I have made corresponding changes to the documentation

    • No documentation changes needed (internal bug fix)
    • Existing documentation remains accurate
  • I have added/updated unit tests that prove my fix is effective or that my feature works

    • Added comprehensive unit tests covering edge cases
    • Added integration test for full workflow validation
    • Tests cover both positive and negative scenarios
  • New and existing unit tests pass locally with my changes

    • All new unit tests pass ✅
    • All existing URL normalization tests pass ✅
    • Integration test passes ✅
    • No regressions introduced ✅

The fix is production-ready and addresses the exact issue described in #1268 while maintaining full backward compatibility.

Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/relative_url

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants