Conversation
Normalize URL encoding in replace_urls() so that percent-encoded URLs in HTML match against decoded dict keys from linkcheck's Url.url. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Why this is a bug
This is also why two test assertions verifying that the new URL count increased were commented out: they reliably failed because the replacement never actually happened. How the fix worksInstead of a direct dict lookup, |
hannaseithe
left a comment
There was a problem hiding this comment.
I have to say that I struggle with reviewing this quite a bit, for the following reasons:
From the original bug, I cannot tell for sure what the actual issue was:
- did using a decoded url for replacing not work or a encoded?
When working on an issue the bug description should clearly describe how to reproduce the bug.
When trying to reproduce I also run into the problem on local that Link objects are not always created on save, and so far I could not figure out why. I was able to create link objects when adding a link on an arabic page, but not on an Ukrainian. See: https://chat.tuerantuer.org/digitalfabrik/pl/z1btgf91m388ddratay3u9ac4r
Regarding your code, as far as I can tell for now:
Unquoting LTR languages might actually work the way you suggested it, but at least with RTL languages like arabic this will not solve it, as unquoting a percentage encoded url gives you the arabic letters in the wrong (LTR) order. Furthermore I am not sure if this is maybe also Unicode normalization issue (NFC vs NFD)
|
@hannaseithe we had this meeting on 2nd February where Salua screenshared the problem occuring here and therefore the issue was created. it was a decoded url hat couldnt' be replaced, a url that occured at multiple (20+ places) in Integreat. urllib.parse.unquote() decodes percent-encoded bytes to Unicode regardless of texirection. RTL (Arabic, Hebrew, etc.) is a rendering concern, not an encoding concern. The concern about "wrong character order" for RTL seems to conflate visual display order with Unicode storage. unquote('%D8%B9%D8%B1%D8%A8%D9%8A') correctly returns 'عربي'. On the test assertions: Looking at the current test file, lines 93-95 and 187-192 are still wrapped in triple-quoted strings (dead code, not actual assertions). The PR reportedly re-enables them worth verifying that they become proper assert statements and not just docstrings on a stray expression. To the "link objects not being created" observation: This is a separate issue from what the PR fixes. The linkcheck library uses Django post_save signal listeners to create Link objects. The tests explicitly need with enable_listeners(): (line 70/162 in the test file), meaning they're disabled by default in tests and likely also need explicit activation in the local dev environment. The intermittent creation after the 7th save is suspicious and may relate to the linkcheck scanner running asynchronously, not the replace_urls() code path. The actual PR bug/fix is clean — the encoding mismatch between lxml (which percent-encodes non-ASCII URLs when extracting from HTML) and the DB-stored decoded Unicode form is real, and building a normalized lookup dict covering both forms is the right approach. Unicode normalization (NFC/NFD) is a separate orthogonal concern and doesn't need to block this fix. |
|
I am sorry, but I cannot reproduce the original bug (as I still do not really understand it). I am taking myself off the reviewer list, since I have already invested several hours and did not manage to progress. I hope someone else has deeper insights and feels comfortable reviewing. |
|
I have asked internally to deliver a testcase so that we are able to reproduce it. If no one can provide such an example i will close this issue with the assumption that it is fixed. |
Summary
rewrite_links()callback and the linkcheckUrl.urldict keysreplace_urls()by building a lookup dict that maps both raw andunquote()-decoded URLs to their replacementsTest plan
tools/test.sh tests/cms/views/link_replace/test_link_actions.pyto verify existing and re-enabled assertions pass🤖 Generated with Claude Code