Fix URL replacement for percent-encoded URLs by dkehne · Pull Request #4152 · digitalfabrik/integreat-cms

dkehne · 2026-02-17T01:28:38Z

Summary

Fixes URL replacement function not working correctly #4129: The linkchecker's "Replace URL" button silently fails for URLs containing non-ASCII characters (e.g., Cyrillic) due to an encoding mismatch between lxml's rewrite_links() callback and the linkcheck Url.url dict keys
Normalizes URL encoding in replace_urls() by building a lookup dict that maps both raw and unquote()-decoded URLs to their replacements
Re-enables previously commented-out test assertions that verify replacement results

Test plan

Run tools/test.sh tests/cms/views/link_replace/test_link_actions.py to verify existing and re-enabled assertions pass
Manually test replacing a URL containing non-ASCII characters (e.g., Cyrillic) via the linkchecker's pencil icon

🤖 Generated with Claude Code

Normalize URL encoding in replace_urls() so that percent-encoded URLs in HTML match against decoded dict keys from linkcheck's Url.url. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dkehne · 2026-03-10T12:18:49Z

Why this is a bug

rewrite_links() from lxml extracts URLs exactly as they appear in the HTML source — for non-ASCII URLs that means percent-encoded form, e.g. https://example.com/%D0%B3%D0%BE%D1%80%D0%BE%D0%B4 (Cyrillic "город"). The urls_to_replace dict, however, is built from Url.url database entries, which store the decoded form: https://example.com/город. So dict.get("%D0%B3%D0%BE%D1%80%D0%BE%D0%B4") returns None, the lambda passes the URL through unchanged, and the replacement silently does nothing — no error, no log message.

This is also why two test assertions verifying that the new URL count increased were commented out: they reliably failed because the replacement never actually happened.

How the fix works

Instead of a direct dict lookup, resolve_url() builds a normalized map that contains both the raw and unquote()-decoded form of every key, then checks both representations of the incoming URL. This covers the four possible combinations (stored raw/decoded × extracted raw/decoded) without changing any data at rest.

hannaseithe

I have to say that I struggle with reviewing this quite a bit, for the following reasons:

From the original bug, I cannot tell for sure what the actual issue was:

did using a decoded url for replacing not work or a encoded?
When working on an issue the bug description should clearly describe how to reproduce the bug.

When trying to reproduce I also run into the problem on local that Link objects are not always created on save, and so far I could not figure out why. I was able to create link objects when adding a link on an arabic page, but not on an Ukrainian. See: https://chat.tuerantuer.org/digitalfabrik/pl/z1btgf91m388ddratay3u9ac4r

Regarding your code, as far as I can tell for now:
Unquoting LTR languages might actually work the way you suggested it, but at least with RTL languages like arabic this will not solve it, as unquoting a percentage encoded url gives you the arabic letters in the wrong (LTR) order. Furthermore I am not sure if this is maybe also Unicode normalization issue (NFC vs NFD)

dkehne · 2026-03-13T12:53:56Z

@hannaseithe we had this meeting on 2nd February where Salua screenshared the problem occuring here and therefore the issue was created. it was a decoded url hat couldnt' be replaced, a url that occured at multiple (20+ places) in Integreat.

urllib.parse.unquote() decodes percent-encoded bytes to Unicode regardless of texirection. RTL (Arabic, Hebrew, etc.) is a rendering concern, not an encoding concern. The concern about "wrong character order" for RTL seems to conflate visual display order with Unicode storage. unquote('%D8%B9%D8%B1%D8%A8%D9%8A') correctly returns 'عربي'.

On the test assertions: Looking at the current test file, lines 93-95 and 187-192 are still wrapped in triple-quoted strings (dead code, not actual assertions). The PR reportedly re-enables them worth verifying that they become proper assert statements and not just docstrings on a stray expression.

To the "link objects not being created" observation: This is a separate issue from what the PR fixes. The linkcheck library uses Django post_save signal listeners to create Link objects. The tests explicitly need with enable_listeners(): (line 70/162 in the test file), meaning they're disabled by default in tests and likely also need explicit activation in the local dev environment. The intermittent creation after the 7th save is suspicious and may relate to the linkcheck scanner running asynchronously, not the replace_urls() code path.

The actual PR bug/fix is clean — the encoding mismatch between lxml (which percent-encodes non-ASCII URLs when extracting from HTML) and the DB-stored decoded Unicode form is real, and building a normalized lookup dict covering both forms is the right approach. Unicode normalization (NFC/NFD) is a separate orthogonal concern and doesn't need to block this fix.

hannaseithe · 2026-03-16T09:40:35Z

I am sorry, but I cannot reproduce the original bug (as I still do not really understand it). I am taking myself off the reviewer list, since I have already invested several hours and did not manage to progress. I hope someone else has deeper insights and feels comfortable reviewing.

dkehne · 2026-03-16T09:49:38Z

I have asked internally to deliver a testcase so that we are able to reproduce it. If no one can provide such an example i will close this issue with the assumption that it is fixed.

dkehne · 2026-03-26T08:12:42Z

@nassabay do you have any example cases at the moment that would enable the cms-team to test? in the original issue (#4129) you already delivered some cases, could you deliver more (current ones)?

Fix URL replacement not working for percent-encoded URLs (#4129)

b6aeec9

Normalize URL encoding in replace_urls() so that percent-encoded URLs in HTML match against decoded dict keys from linkcheck's Url.url. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hannaseithe added the needs-reviewer label Feb 17, 2026

hannaseithe self-assigned this Feb 23, 2026

hannaseithe self-requested a review February 23, 2026 09:40

hannaseithe reviewed Mar 12, 2026

View reviewed changes

hannaseithe removed their assignment Mar 16, 2026

hannaseithe added this to the Next milestone Mar 17, 2026

jarlhengstmengel removed this from the Next milestone Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix URL replacement for percent-encoded URLs#4152

Fix URL replacement for percent-encoded URLs#4152
dkehne wants to merge 1 commit intodevelopfrom
fix/url-replace-percent-encoded

dkehne commented Feb 17, 2026

Uh oh!

dkehne commented Mar 10, 2026

Uh oh!

hannaseithe left a comment

Uh oh!

dkehne commented Mar 13, 2026

Uh oh!

hannaseithe commented Mar 16, 2026

Uh oh!

dkehne commented Mar 16, 2026

Uh oh!

dkehne commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dkehne commented Feb 17, 2026

Summary

Test plan

Uh oh!

dkehne commented Mar 10, 2026

Why this is a bug

How the fix works

Uh oh!

hannaseithe left a comment

Choose a reason for hiding this comment

Uh oh!

dkehne commented Mar 13, 2026

Uh oh!

hannaseithe commented Mar 16, 2026

Uh oh!

dkehne commented Mar 16, 2026

Uh oh!

dkehne commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants