Skip to content

Fix WordPress import image variant URL rewrites#911

Merged
ascorbic merged 3 commits intoemdash-cms:mainfrom
masonjames:fix/wp-import-size-variant-urls
May 6, 2026
Merged

Fix WordPress import image variant URL rewrites#911
ascorbic merged 3 commits intoemdash-cms:mainfrom
masonjames:fix/wp-import-size-variant-urls

Conversation

@masonjames
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes WordPress import media URL rewriting for content that references WordPress-generated image size variants such as hero-1024x695.jpg when the media import urlMap only contains the original attachment URL, e.g. hero.jpg.

The matcher now preserves query-only base matching, then additionally checks a WordPress image-size-suffix candidate for media URLs. String rewrites also allow the same optional size suffix while guarding against prefix rewrites inside longer filenames such as hero-1024x695.jpg.webp.

Closes #645

Type of change

  • Bug fix
  • Feature (requires maintainer-approved Discussion)
  • Refactor (no behavior change)
  • Translation
  • Documentation
  • Performance improvement
  • Tests
  • Chore (dependencies, CI, tooling)

Checklist

  • I have read CONTRIBUTING.md
  • pnpm typecheck passes
  • pnpm lint passes
  • pnpm test passes (or targeted tests for my change)
  • pnpm format has been run
  • I have added/updated tests for my changes (if applicable)
  • User-visible strings in the admin UI are wrapped for translation (if applicable). Do not include messages.po changes except in translation PRs — a workflow extracts catalogs on merge to main.
  • I have added a changeset (if this PR changes a published package)
  • New features link to an approved Discussion: N/A — bug fix

AI-generated code disclosure

  • This PR includes AI-generated code — model/tool: GPT-5 Codex

Screenshots / test output

Not visual.

Novelty / overlap check:

gh issue view 645 --repo emdash-cms/emdash
=> #645 is open with no comments

gh search prs --repo emdash-cms/emdash --state open "645 OR size-variant OR \"rewrite-urls\" OR \"WordPress import\" \"image URLs\""
=> no open PRs matching this fix

Reviewed open #607 because it also touches WordPress import; it does not modify rewrite-urls.ts and covers broader menus/settings/custom-block image import work instead.

Local checks:

pnpm --silent lint:quick
=> { "diagnostics": [] }

cd packages/core && pnpm exec vitest run tests/unit/import/wordpress-rewrite-urls.test.ts
=> Test Files 1 passed; Tests 7 passed

pnpm --filter emdash typecheck
=> passed

pnpm format
=> passed

pnpm --silent lint:json | jq '.diagnostics | length'
=> 55 existing diagnostics before and after this change; none are in files touched by this PR.

Additional review:

Adversarial review cycle 1: found two URL-matching edge cases.
Fixes applied: preserve dimension-named original attachment URLs and reject prefix matches inside longer filenames.
Adversarial review cycle 2: found punctuation boundary regression for bare string URLs.
Fix applied: accept common prose delimiters while keeping the longer-filename guard.
Adversarial review cycle 3: No blockers found.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 4, 2026

🦋 Changeset detected

Latest commit: aec3804

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 13 packages
Name Type
emdash Patch
@emdash-cms/cloudflare Patch
@emdash-cms/fixture-perf-site Patch
@emdash-cms/perf-demo-site Patch
@emdash-cms/cache-demo-site Patch
@emdash-cms/admin Patch
@emdash-cms/auth Patch
@emdash-cms/blocks Patch
@emdash-cms/gutenberg-to-portable-text Patch
@emdash-cms/x402 Patch
create-emdash Patch
@emdash-cms/auth-atproto Patch
@emdash-cms/plugin-embeds Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 4, 2026

Open in StackBlitz

@emdash-cms/admin

npm i https://pkg.pr.new/@emdash-cms/admin@911

@emdash-cms/auth

npm i https://pkg.pr.new/@emdash-cms/auth@911

@emdash-cms/blocks

npm i https://pkg.pr.new/@emdash-cms/blocks@911

@emdash-cms/cloudflare

npm i https://pkg.pr.new/@emdash-cms/cloudflare@911

emdash

npm i https://pkg.pr.new/emdash@911

create-emdash

npm i https://pkg.pr.new/create-emdash@911

@emdash-cms/gutenberg-to-portable-text

npm i https://pkg.pr.new/@emdash-cms/gutenberg-to-portable-text@911

@emdash-cms/x402

npm i https://pkg.pr.new/@emdash-cms/x402@911

@emdash-cms/plugin-ai-moderation

npm i https://pkg.pr.new/@emdash-cms/plugin-ai-moderation@911

@emdash-cms/plugin-atproto

npm i https://pkg.pr.new/@emdash-cms/plugin-atproto@911

@emdash-cms/plugin-audit-log

npm i https://pkg.pr.new/@emdash-cms/plugin-audit-log@911

@emdash-cms/plugin-color

npm i https://pkg.pr.new/@emdash-cms/plugin-color@911

@emdash-cms/plugin-embeds

npm i https://pkg.pr.new/@emdash-cms/plugin-embeds@911

@emdash-cms/plugin-forms

npm i https://pkg.pr.new/@emdash-cms/plugin-forms@911

@emdash-cms/plugin-webhook-notifier

npm i https://pkg.pr.new/@emdash-cms/plugin-webhook-notifier@911

commit: aec3804

@masonjames masonjames force-pushed the fix/wp-import-size-variant-urls branch from 2247601 to 03b12ba Compare May 4, 2026 02:32
@github-actions github-actions Bot added size/L and removed size/M labels May 4, 2026
@masonjames masonjames marked this pull request as ready for review May 4, 2026 02:51
Copilot AI review requested due to automatic review settings May 4, 2026 02:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes WordPress import URL rewriting so content that references WordPress-generated image size variants (e.g. hero-1024x695.jpg) is correctly rewritten even when the urlMap only contains the original attachment URL (e.g. hero.jpg).

Changes:

  • Extract URL rewrite helpers into a dedicated module and expand matching to handle WordPress -NNNxNNN size suffix variants.
  • Tighten string URL rewriting to avoid prefix rewrites inside longer filenames (e.g. ...jpg.webp) while still matching common punctuation boundaries.
  • Add focused unit tests covering variant matching, dimension-named originals, longer-filename guards, and prose punctuation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
packages/core/src/astro/routes/api/import/wordpress/rewrite-urls.ts Switches endpoint implementation to use extracted helper functions for URL matching/rewriting.
packages/core/src/astro/routes/api/import/wordpress/rewrite-url-helpers.ts New helper module implementing base URL matching + WordPress size-suffix handling + safer string matching regex.
packages/core/tests/unit/import/wordpress-rewrite-urls.test.ts New unit tests validating matching and rewrite behavior for variants and edge cases.
.changeset/wp-import-size-variant-urls.md Patch changeset documenting the fix for published package(s).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/astro/routes/api/import/wordpress/rewrite-urls.ts
@ascorbic
Copy link
Copy Markdown
Collaborator

ascorbic commented May 5, 2026

/review

Copy link
Copy Markdown
Collaborator

@ascorbic ascorbic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 5, 2026

Let me think about one more potential issue: the regex pattern uses [^./?#]+$ for the extension. What if the URL has a percent-encoded extension or unusual chars? For typical WP URLs (.jpg, .png, .webp, .gif) this is fine.

Let me also consider: stripWordPressImageSizeSuffix behavior when baseUrl has a ?: Wait, getBaseUrl already strips the query, so baseUrl passed to stripWordPressImageSizeSuffix won't have query. Good.

Let me check if there's a potential issue where the regex's terminator causes incorrect matching of CSS-style URLs:

In CSS: background-image: url(https://example.com/hero-1024x695.jpg); - after .jpg comes ). ) IS in the outer class. ✓

Markdown image: ![alt](https://example.com/hero-1024x695.jpg) - same. ✓

JSON in srcset: srcset="hero-300x200.jpg 300w, hero-600x400.jpg 600w" - first URL ends with space, then 300w. Space matches. Second URL ends with ". Matches. ✓

Overall this PR looks well-implemented. The author addressed the Copilot reviewer's two concerns:

  1. The > boundary case is handled (and tested at line 56-65).
  2. The PortableTextBlock type was exported and imported.

Let me also verify no double-counting issue is introduced as a regression. Looking at rewriteStringUrls:

for (const [baseUrl, newUrl] of baseMap.entries()) {
    const regex = buildBaseUrlMatchRegex(baseUrl);
    const matches = newValue.match(regex);
    if (matches) {
        for (const match of matches) {
            if (!exactMap[match]) {
                newValue = newValue.split(match).join(newUrl);
                ...
            }
        }
    }
}

If matches contains duplicates (same URL appearing multiple times in the string), the first iteration replaces all; subsequent iterations of duplicates do split.join which essentially no-ops, but urlsRewritten++ still increments. Pre-existing behavior, not new.

Actually wait, I want to double-check one thing: does the new regex's optional -NNNxNNN cause double-counting when both hero.jpg and hero-1024x695.jpg appear in the same string?

Say string is "hero.jpg and hero-1024x695.jpg" and urlMap is {hero.jpg: newUrl}.

  1. Exact match phase: includes("hero.jpg") → true. split("hero.jpg").join(newUrl)"newUrl and hero-1024x695.jpg" (because hero.jpg IS a substring of hero-1024x695.jpg!). Hmm, wait. "hero-1024x695.jpg" contains "hero.jpg"? No - hero-1024x695.jpg has hero-1... not hero.jpg. So no false match. split("hero.jpg") returns ["", " and hero-1024x695.jpg"]. Wait actually: Looking at the FULL string "hero.jpg and hero-1024x695.jpg", does it contain "hero.jpg"? Yes, at position 0 (and only there because hero-1024x695.jpg has a - after hero). So split.join produces "newUrl and hero-1024x695.jpg". ✓

  2. Base match phase: baseMap has hero.jpg→newUrl. Regex hero(?:-\d+x\d+)?\.jpg with terminators. Matches hero-1024x695.jpg. Replaces it. ✓

urlsRewritten = 2 (one from exact, one from base). Correct count! ✓

OK, looks good. One concern I want to flag though:

Potential issue: In the regex, the lookahead ["'\s<>)\],;:!?] includes ?. But a ? inside the character class is treated as literal. Hmm but URLs use ? to start query strings. The (\?...)? group BEFORE the lookahead consumes the query. So if URL is hero.jpg?q=1, it consumes ?q=1, then lookahead sees end-of-string or terminator. But what if there's something AFTER ?q=1 like hero.jpg?q=1?other? The query group is greedy [^"'\s]* which would consume everything until whitespace/quote. So ?q=1?other would all be consumed by the query group. Then lookahead checks next char. OK.

What about hero.jpg? alone (no query value)? Query group matches ?. Then lookahead checks next char (end or terminator). Whatever comes next.

Alright, I don't see critical bugs. The PR is well-structured: the helpers were extracted, tests were added, and adversarial review addressed found edge cases. The Copilot bot's two concerns were both addressed by the author.

Let me look one more time at the BASE_URL_EXTENSION regex which has ^(.+)(\.[^./?#]+)$. The greedy .+ means for https://example.com/path/hero.jpg, group 1 captures everything up to the last .. Good. But wait - what if the base URL has no path-final segment with a dot, e.g., https://example.com/feed/? Then no match, falls back to plain match. ✓

And the BASE_URL_EXTENSION regex doesn't anchor [^./?#]+ to the path component - but since getBaseUrl strips query and fragment, and there's no ? or # left, the only . in the URL would be in the host (e.g., example.com) and in the file extension. But wait - example.com.something/foo? URLs with dots in their domain. Hmm, getBaseUrl(https://example.com/foo.jpg)https://example.com/foo.jpg. The greedy .+ matches up to the last ., so group 1 = https://example.com/foo, group 2 = .jpg. ✓ The dot in example.com doesn't confuse it because .+ is greedy.

What about https://example.com/feed (no extension)? ^(.+)(\.[^./?#]+)$ requires a final .ext. No match. Falls back to plain escapeRegExp. ✓

What about https://example.com/foo (just domain + path, no extension, no slash terminator)? Same — no match, plain escape. ✓

What about malformed URL strings where the LAST . is in the domain like https://example.com? ^(.+)(\.[^./?#]+)$ - .+ greedy matches https://example, then .com matches as extension. So basePattern would be https://example(?:-\d+x\d+)?\.com. This is harmless because such URLs won't appear in actual content, but it's a slightly weird outcome. Not a real bug.

I think this PR is solid. The only minor observation worth a comment is about the unquoted attribute self-closing case <img src=...jpg/> (the / is not in the terminator class), but this is genuinely uncommon in WordPress content (which uses HTML, not XHTML), and is more of an edge note than an actionable concern.

Given the thorough adversarial review the author already performed, the comprehensive tests, the addressing of all Copilot feedback, and the focused scope of the change, I think this is good.

LGTM!

github run

@ascorbic ascorbic merged commit 9146931 into emdash-cms:main May 6, 2026
28 checks passed
@emdashbot emdashbot Bot mentioned this pull request May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WordPress import: size-variant image URLs in post content are not rewritten

3 participants