Skip to content

Improved URL service and sitemap boot performance#26676

Closed
rob-ghost wants to merge 6 commits intomainfrom
chore/url-service-boot-optimisations
Closed

Improved URL service and sitemap boot performance#26676
rob-ghost wants to merge 6 commits intomainfrom
chore/url-service-boot-optimisations

Conversation

@rob-ghost
Copy link
Copy Markdown
Contributor

The URL service is one of the slowest parts of Ghost's boot sequence. It fetches every published resource from the database, generates URLs, and then notifies the sitemap service one resource at a time via events. This PR tackles four independent bottlenecks in that pipeline.

First, the Urls data store was coupled to event emission — every call to Urls.add() fired a url.added event, which the sitemap service consumed one-by-one during boot. This meant N event dispatches, N moment.js date parses, and N XML node constructions before the server could start. The store is now a pure data structure, and the URL service emits a single bulk url.init event after all URLs are generated. The sitemap manager handles this bulk event to initialize all entries at once, while runtime additions/removals still use per-resource events emitted by the UrlGenerator.

Second, the resource config used exclude lists that grew with every new schema column and still fetched ~20 columns per post when only ~10 are needed for URL generation, sitemap XML, and change detection. These are now explicit include lists, making the contract between the URL service and the database self-documenting and reducing query payload.

Third, raw-knex.js existed specifically to bypass Bookshelf ORM overhead, yet still called toJSON(), fixBools(), and fixDatesWhenFetch() through the Bookshelf prototype on every row. toJSON/serialize just shallow-copy attributes with no meaningful transformation in this context. fixDatesWhenFetch parses dates with moment.js redundantly since knex already returns JavaScript Date objects. The only necessary operation — boolean coercion — is now a pre-computed column loop without Bookshelf prototype lookups.

Fourth, relation queries (tags, authors) used WHERE IN with every post ID materialized as a literal. For 10k posts that's 10k string values the query planner must parse. This is replaced with a subquery that mirrors the main query's NQL filter, letting the database use an index scan instead of parsing a literal list.

An integration test suite (url-service-and-sitemap.test.js) was added first as a safety net, covering both default routing and custom multi-collection routing with custom fixtures. It verifies URL generation, sitemap content, canonical exclusions, and draft filtering end-to-end.

Safety-net test that boots the URL service with custom fixtures and
verifies both URL resolution and sitemap XML output from the same
entrypoint. Tests outcomes: correct paths per collection, canonical_url
exclusion, draft exclusion, orphan tag exclusion, feature_image in
sitemap image nodes, and multi-collection routing.
The Urls class emitted url.added/url.removed events on every add/remove,
coupling the data structure to notification. During boot this fires once
per resource — the only consumer is the sitemap service, which does
per-row moment() parsing and XML node construction for each event.

Urls is now a pure data structure. UrlGenerator emits url.added and
url.removed for runtime changes (post published, updated, deleted).
UrlService emits a single url.init event after boot completes, and
SiteMapManager handles it in bulk. This eliminates N event emissions
during init, replacing them with one.
The old config used exclude lists that grew with every new column added
to the schema. Include lists are explicit about what the URL service
needs: only fields used for URL generation (permalink patterns, NQL
filter evaluation), sitemap XML (dates, images, canonical_url), and
runtime change detection. This reduces the query payload for posts from
~20 columns to 10, and similarly for other resource types.

Also updated raw-knex.js to support `include` option for column
selection, and resources.js to derive ignored-properties for change
detection from the include list rather than the exclude list.
raw-knex.js existed specifically to bypass Bookshelf's per-row overhead,
but still called toJSON(), fixBools(), and fixDatesWhenFetch() through
the Bookshelf prototype on every row. toJSON/serialize just shallow-copy
attributes with no meaningful transformation. fixDatesWhenFetch parses
dates with moment.js but knex already returns JavaScript Date objects.
fixBools is the only necessary operation — replaced with a pre-computed
boolean column loop that runs without Bookshelf prototype lookups or
moment.js overhead.
Relation queries (tags, authors) used WHERE IN with every post ID
materialized as a literal — for 10k posts that's 10k string values the
query planner must parse and optimize. Replaced with a subquery that
mirrors the main query's NQL filter and shouldHavePosts conditions,
letting the database use an index scan instead.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/url-service-boot-optimisations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ErisDS
Copy link
Copy Markdown
Member

ErisDS commented Mar 3, 2026

🤖 Velo CI Failure Analysis

Classification: 🟠 SOFT FAIL

  • Workflow: CI
  • Failed Step: Legacy tests
  • Run: View failed run
    What failed: Test assertion failure: expected 301 to equal 200 in frontend behavior tests
    Why: The root cause is a test assertion failure, where the expected HTTP status code of 301 did not match the actual 200. This is a code issue, as the test is validating the application's behavior and the failure indicates a problem with the code under test.
    Action:
    The author should investigate the frontend behavior tests and fix the issue that is causing the assertion failure. This is likely a bug in the application code that needs to be addressed.

// exclude fields if provided
if (exclude) {
// select only the fields needed
if (include) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a debug assertion that errors when both include and exclude are provided?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this was a PoC which is being decomposed into more reviewable chunks, the first of which is here which removes exclude as its not used by anything after this change.

@rob-ghost
Copy link
Copy Markdown
Contributor Author

Closing in favour of decomposing into more reviewable chunks, first is here: #26689

@rob-ghost rob-ghost closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants