Skip to content

Fix URL replacement for percent-encoded URLs#4152

Open
dkehne wants to merge 1 commit intodevelopfrom
fix/url-replace-percent-encoded
Open

Fix URL replacement for percent-encoded URLs#4152
dkehne wants to merge 1 commit intodevelopfrom
fix/url-replace-percent-encoded

Conversation

@dkehne
Copy link
Copy Markdown
Collaborator

@dkehne dkehne commented Feb 17, 2026

Summary

  • Fixes URL replacement function not working correctly #4129: The linkchecker's "Replace URL" button silently fails for URLs containing non-ASCII characters (e.g., Cyrillic) due to an encoding mismatch between lxml's rewrite_links() callback and the linkcheck Url.url dict keys
  • Normalizes URL encoding in replace_urls() by building a lookup dict that maps both raw and unquote()-decoded URLs to their replacements
  • Re-enables previously commented-out test assertions that verify replacement results

Test plan

  • Run tools/test.sh tests/cms/views/link_replace/test_link_actions.py to verify existing and re-enabled assertions pass
  • Manually test replacing a URL containing non-ASCII characters (e.g., Cyrillic) via the linkchecker's pencil icon

🤖 Generated with Claude Code

Normalize URL encoding in replace_urls() so that percent-encoded URLs
in HTML match against decoded dict keys from linkcheck's Url.url.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hannaseithe hannaseithe self-assigned this Feb 23, 2026
@hannaseithe hannaseithe self-requested a review February 23, 2026 09:40
@dkehne
Copy link
Copy Markdown
Collaborator Author

dkehne commented Mar 10, 2026

Why this is a bug

rewrite_links() from lxml extracts URLs exactly as they appear in the HTML source — for non-ASCII URLs that means percent-encoded form, e.g. https://example.com/%D0%B3%D0%BE%D1%80%D0%BE%D0%B4 (Cyrillic "город"). The urls_to_replace dict, however, is built from Url.url database entries, which store the decoded form: https://example.com/город. So dict.get("%D0%B3%D0%BE%D1%80%D0%BE%D0%B4") returns None, the lambda passes the URL through unchanged, and the replacement silently does nothing — no error, no log message.

This is also why two test assertions verifying that the new URL count increased were commented out: they reliably failed because the replacement never actually happened.

How the fix works

Instead of a direct dict lookup, resolve_url() builds a normalized map that contains both the raw and unquote()-decoded form of every key, then checks both representations of the incoming URL. This covers the four possible combinations (stored raw/decoded × extracted raw/decoded) without changing any data at rest.

Copy link
Copy Markdown
Contributor

@hannaseithe hannaseithe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that I struggle with reviewing this quite a bit, for the following reasons:

From the original bug, I cannot tell for sure what the actual issue was:

  • did using a decoded url for replacing not work or a encoded?
    When working on an issue the bug description should clearly describe how to reproduce the bug.

When trying to reproduce I also run into the problem on local that Link objects are not always created on save, and so far I could not figure out why. I was able to create link objects when adding a link on an arabic page, but not on an Ukrainian. See: https://chat.tuerantuer.org/digitalfabrik/pl/z1btgf91m388ddratay3u9ac4r

Regarding your code, as far as I can tell for now:
Unquoting LTR languages might actually work the way you suggested it, but at least with RTL languages like arabic this will not solve it, as unquoting a percentage encoded url gives you the arabic letters in the wrong (LTR) order. Furthermore I am not sure if this is maybe also Unicode normalization issue (NFC vs NFD)

@dkehne
Copy link
Copy Markdown
Collaborator Author

dkehne commented Mar 13, 2026

@hannaseithe we had this meeting on 2nd February where Salua screenshared the problem occuring here and therefore the issue was created. it was a decoded url hat couldnt' be replaced, a url that occured at multiple (20+ places) in Integreat.

urllib.parse.unquote() decodes percent-encoded bytes to Unicode regardless of texirection. RTL (Arabic, Hebrew, etc.) is a rendering concern, not an encoding concern. The concern about "wrong character order" for RTL seems to conflate visual display order with Unicode storage. unquote('%D8%B9%D8%B1%D8%A8%D9%8A') correctly returns 'عربي'.

On the test assertions: Looking at the current test file, lines 93-95 and 187-192 are still wrapped in triple-quoted strings (dead code, not actual assertions). The PR reportedly re-enables them worth verifying that they become proper assert statements and not just docstrings on a stray expression.

To the "link objects not being created" observation: This is a separate issue from what the PR fixes. The linkcheck library uses Django post_save signal listeners to create Link objects. The tests explicitly need with enable_listeners(): (line 70/162 in the test file), meaning they're disabled by default in tests and likely also need explicit activation in the local dev environment. The intermittent creation after the 7th save is suspicious and may relate to the linkcheck scanner running asynchronously, not the replace_urls() code path.

The actual PR bug/fix is clean — the encoding mismatch between lxml (which percent-encodes non-ASCII URLs when extracting from HTML) and the DB-stored decoded Unicode form is real, and building a normalized lookup dict covering both forms is the right approach. Unicode normalization (NFC/NFD) is a separate orthogonal concern and doesn't need to block this fix.

@hannaseithe
Copy link
Copy Markdown
Contributor

I am sorry, but I cannot reproduce the original bug (as I still do not really understand it). I am taking myself off the reviewer list, since I have already invested several hours and did not manage to progress. I hope someone else has deeper insights and feels comfortable reviewing.

@hannaseithe hannaseithe removed their assignment Mar 16, 2026
@dkehne
Copy link
Copy Markdown
Collaborator Author

dkehne commented Mar 16, 2026

I have asked internally to deliver a testcase so that we are able to reproduce it. If no one can provide such an example i will close this issue with the assumption that it is fixed.

@hannaseithe hannaseithe added this to the Next milestone Mar 17, 2026
@jarlhengstmengel jarlhengstmengel removed this from the Next milestone Mar 17, 2026
@dkehne
Copy link
Copy Markdown
Collaborator Author

dkehne commented Mar 26, 2026

@nassabay do you have any example cases at the moment that would enable the cms-team to test? in the original issue (#4129) you already delivered some cases, could you deliver more (current ones)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

URL replacement function not working correctly

4 participants