Skip to content

Conversation

@twm
Copy link
Contributor

@twm twm commented May 28, 2025

I stumbled across a feed with an <img src="..." srcset="..."> where the src was busted, but the srcset is good, so here we are: this teaches the relative URL resolver how to resolve relative URLs within the srcset attribute.

I also added srcset to the list of allowed attributes in the sanitizer so that the test would work. There's a caveat to this: the width descriptor form of srcset require the sizes attribute too, but that attribute isn't allowed by the sanitizer (and it contains CSS media queries, so that'd be nontrivial to add). A browser will ignore the srcset in that case, so I think the end result is status quo.

I wasn't sure how many tests to add, or exactly where to put them. Please advise!

twm and others added 4 commits May 27, 2025 22:22
normalize_attributes always does a copy when the list is non-empty, so
there's no need to copy again.
@kurtmckee
Copy link
Owner

Hi Tom!

I don't want to add to feedparser's custom HTML sanitizing code, as feedparser needs to contract its responsibilities down to just parsing feeds to become maintainable again. Instead, I'd prefer to see a PR that eliminates feedparser's HTML sanitizing code and replaces it with a Python library that is dedicated to that purpose. You may remember #257, where I thought that bleach might be a package that could achieve this goal.

Could html-sanitizer be an alternative?

@twm
Copy link
Contributor Author

twm commented May 30, 2025

I am skeptical of html-sanitizer — it depends on lxml which is ultimately backed by libxml2 and libxslt. Those projects are in a rough spot right now so I'd hesitate to move in that direction. (Also, it depends on beautifulsoup4. Perhaps that library has improved on its molasses-slow predecessors but I'd benchmark before adopting it.)

On my own project I'm still using html5lib + lxml for sanitization and manipulation, but I've had my eye out for alternatives. I haven't seen anything else as flexible yet.

For sanitization alone, I've used nh3 in the past. This is backed by the Rust ammonia and html5ever libraries, so less likely to blow your hand off than C libxml2. Last I checked (admittedly, a few years ago) html5everalso passed more of the html5lib-tests than html5lib itself, so more compliant. Caveat: html5ever is part of the Servo project, which while revived in 2023 has had fallow periods. It still seems safer for feedparser's users than lxml. I think that the main risk in adopting nh3 is that it stops producing wheels for new Pythons in the future.

For this PR, I didn't want to touch sanitization at all (again, I don't use it!) but it is applied by default to the test vectors, so it would drop the srcset attribute that I need to assert on. I could drop this test vector and write an equivalent test that only applies to the URL resolution logic. Does that sound reasonable to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants