Improved URL service and sitemap boot performance#26676
Improved URL service and sitemap boot performance#26676
Conversation
Safety-net test that boots the URL service with custom fixtures and verifies both URL resolution and sitemap XML output from the same entrypoint. Tests outcomes: correct paths per collection, canonical_url exclusion, draft exclusion, orphan tag exclusion, feature_image in sitemap image nodes, and multi-collection routing.
The Urls class emitted url.added/url.removed events on every add/remove, coupling the data structure to notification. During boot this fires once per resource — the only consumer is the sitemap service, which does per-row moment() parsing and XML node construction for each event. Urls is now a pure data structure. UrlGenerator emits url.added and url.removed for runtime changes (post published, updated, deleted). UrlService emits a single url.init event after boot completes, and SiteMapManager handles it in bulk. This eliminates N event emissions during init, replacing them with one.
The old config used exclude lists that grew with every new column added to the schema. Include lists are explicit about what the URL service needs: only fields used for URL generation (permalink patterns, NQL filter evaluation), sitemap XML (dates, images, canonical_url), and runtime change detection. This reduces the query payload for posts from ~20 columns to 10, and similarly for other resource types. Also updated raw-knex.js to support `include` option for column selection, and resources.js to derive ignored-properties for change detection from the include list rather than the exclude list.
raw-knex.js existed specifically to bypass Bookshelf's per-row overhead, but still called toJSON(), fixBools(), and fixDatesWhenFetch() through the Bookshelf prototype on every row. toJSON/serialize just shallow-copy attributes with no meaningful transformation. fixDatesWhenFetch parses dates with moment.js but knex already returns JavaScript Date objects. fixBools is the only necessary operation — replaced with a pre-computed boolean column loop that runs without Bookshelf prototype lookups or moment.js overhead.
Relation queries (tags, authors) used WHERE IN with every post ID materialized as a literal — for 10k posts that's 10k string values the query planner must parse and optimize. Replaced with a subquery that mirrors the main query's NQL filter and shouldHavePosts conditions, letting the database use an index scan instead.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🤖 Velo CI Failure AnalysisClassification: 🟠 SOFT FAIL
|
| // exclude fields if provided | ||
| if (exclude) { | ||
| // select only the fields needed | ||
| if (include) { |
There was a problem hiding this comment.
Should there be a debug assertion that errors when both include and exclude are provided?
There was a problem hiding this comment.
FYI this was a PoC which is being decomposed into more reviewable chunks, the first of which is here which removes exclude as its not used by anything after this change.
|
Closing in favour of decomposing into more reviewable chunks, first is here: #26689 |
The URL service is one of the slowest parts of Ghost's boot sequence. It fetches every published resource from the database, generates URLs, and then notifies the sitemap service one resource at a time via events. This PR tackles four independent bottlenecks in that pipeline.
First, the
Urlsdata store was coupled to event emission — every call toUrls.add()fired aurl.addedevent, which the sitemap service consumed one-by-one during boot. This meant N event dispatches, N moment.js date parses, and N XML node constructions before the server could start. The store is now a pure data structure, and the URL service emits a single bulkurl.initevent after all URLs are generated. The sitemap manager handles this bulk event to initialize all entries at once, while runtime additions/removals still use per-resource events emitted by theUrlGenerator.Second, the resource config used exclude lists that grew with every new schema column and still fetched ~20 columns per post when only ~10 are needed for URL generation, sitemap XML, and change detection. These are now explicit include lists, making the contract between the URL service and the database self-documenting and reducing query payload.
Third,
raw-knex.jsexisted specifically to bypass Bookshelf ORM overhead, yet still calledtoJSON(),fixBools(), andfixDatesWhenFetch()through the Bookshelf prototype on every row.toJSON/serializejust shallow-copy attributes with no meaningful transformation in this context.fixDatesWhenFetchparses dates with moment.js redundantly since knex already returns JavaScript Date objects. The only necessary operation — boolean coercion — is now a pre-computed column loop without Bookshelf prototype lookups.Fourth, relation queries (tags, authors) used
WHERE INwith every post ID materialized as a literal. For 10k posts that's 10k string values the query planner must parse. This is replaced with a subquery that mirrors the main query's NQL filter, letting the database use an index scan instead of parsing a literal list.An integration test suite (
url-service-and-sitemap.test.js) was added first as a safety net, covering both default routing and custom multi-collection routing with custom fixtures. It verifies URL generation, sitemap content, canonical exclusions, and draft filtering end-to-end.