Resolve relative `srcset` URLs #509

twm · 2025-05-28T06:04:39Z

I stumbled across a feed with an <img src="..." srcset="..."> where the src was busted, but the srcset is good, so here we are: this teaches the relative URL resolver how to resolve relative URLs within the srcset attribute.

I also added srcset to the list of allowed attributes in the sanitizer so that the test would work. There's a caveat to this: the width descriptor form of srcset require the sizes attribute too, but that attribute isn't allowed by the sanitizer (and it contains CSS media queries, so that'd be nontrivial to add). A browser will ignore the srcset in that case, so I think the end result is status quo.

I wasn't sure how many tests to add, or exactly where to put them. Please advise!

normalize_attributes always does a copy when the list is non-empty, so there's no need to copy again.

for more information, see https://pre-commit.ci

kurtmckee · 2025-05-28T12:34:22Z

Hi Tom!

I don't want to add to feedparser's custom HTML sanitizing code, as feedparser needs to contract its responsibilities down to just parsing feeds to become maintainable again. Instead, I'd prefer to see a PR that eliminates feedparser's HTML sanitizing code and replaces it with a Python library that is dedicated to that purpose. You may remember #257, where I thought that bleach might be a package that could achieve this goal.

Could html-sanitizer be an alternative?

twm · 2025-05-30T06:38:04Z

I am skeptical of html-sanitizer — it depends on lxml which is ultimately backed by libxml2 and libxslt. Those projects are in a rough spot right now so I'd hesitate to move in that direction. (Also, it depends on beautifulsoup4. Perhaps that library has improved on its molasses-slow predecessors but I'd benchmark before adopting it.)

On my own project I'm still using html5lib + lxml for sanitization and manipulation, but I've had my eye out for alternatives. I haven't seen anything else as flexible yet.

For sanitization alone, I've used nh3 in the past. This is backed by the Rust ammonia and html5ever libraries, so less likely to blow your hand off than C libxml2. Last I checked (admittedly, a few years ago) html5everalso passed more of the html5lib-tests than html5lib itself, so more compliant. Caveat: html5ever is part of the Servo project, which while revived in 2023 has had fallow periods. It still seems safer for feedparser's users than lxml. I think that the main risk in adopting nh3 is that it stops producing wheels for new Pythons in the future.

For this PR, I didn't want to touch sanitization at all (again, I don't use it!) but it is applied by default to the test vectors, so it would drop the srcset attribute that I need to assert on. I could drop this test vector and write an equivalent test that only applies to the URL resolution logic. Does that sound reasonable to you?

twm and others added 4 commits May 27, 2025 22:22

Avoid a copy of the attribute list

09e2c0f

normalize_attributes always does a copy when the list is non-empty, so there's no need to copy again.

Resolve relative srcset URLs

6558d8d

Add changefragment

ca863e5

[pre-commit.ci] auto fixes from pre-commit.com hooks

95ad1e5

for more information, see https://pre-commit.ci

twm mentioned this pull request May 28, 2025

Pass through pixel density srcset when sanitizing twm/yarrharr#1292

Merged

twm mentioned this pull request Jun 1, 2025

Support <img srcset> twm/yarrharr#1290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Resolve relative `srcset` URLs #509

Resolve relative `srcset` URLs #509

Uh oh!

twm commented May 28, 2025

Uh oh!

kurtmckee commented May 28, 2025

Uh oh!

twm commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Resolve relative srcset URLs #509

Are you sure you want to change the base?

Resolve relative srcset URLs #509

Uh oh!

Conversation

twm commented May 28, 2025

Uh oh!

kurtmckee commented May 28, 2025

Uh oh!

twm commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Resolve relative `srcset` URLs #509

Resolve relative `srcset` URLs #509