Skip to content

JSON-LD Graph Manager#5829

Draft
hparra wants to merge 36 commits into
stagefrom
hgpa/jsonld-graph-manager
Draft

JSON-LD Graph Manager#5829
hparra wants to merge 36 commits into
stagefrom
hgpa/jsonld-graph-manager

Conversation

@hparra
Copy link
Copy Markdown
Member

@hparra hparra commented Apr 18, 2026

Summary

The JSON-LD Graph Manager is a Milo feature that collects all the JSON-LD on a page and rewrites it as one canonical, linked @graph. This centralization enables the manager to automatically apply JSON-LD graph features that may improve search engine and LLM visibility, such as cross-entity @id linking and singleton enforcement for certain types.

Specification

See libs/utils/json-ld.md.

Testing

You can use the following URL query parameters with any AEM url:

  • milolibs=hgpa-jsonld-graph-manager to load this Milo from this branch
  • jsonld-graph-manager=true to enable the feature (off by default). This can also be done via page metadata.
  • jsonld-graph-manager-debug=true to enable console.debug logging. Remember to add 'Verbose' to Console levels to view.

Example URLs:

Use the following JavaScript snippet to quickly parse available JSON-LD content:

JSON.parse(document.querySelector('script[data-milo-jsonld="graph"]')?.textContent ?? 'null')
  ?? [...document.querySelectorAll('script[type="application/ld+json"]')]
       .map(s => { try { return JSON.parse(s.textContent); } catch { return null; } })
       .filter(Boolean);

hparra and others added 2 commits April 18, 2026 12:52
Group the flat 17-section layout into five titled parts (Motivation,
Architecture, Data Model & Validation, Operations, Reference) with
short intros, add a design-spec status banner, add TL;DR leads to the
densest sections, de-duplicate canonical-identity and producer-contract
discussion, and add a manager-vs-cohort comparison table.

Add five Operations sections promised but not previously specified:
Testing Strategy, Performance Considerations, Rollback And Coexistence,
Direct-Push API Surface, and Security Considerations. Open questions
are marked inline so reviewers can react to concrete text.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@aem-code-sync
Copy link
Copy Markdown
Contributor

aem-code-sync Bot commented Apr 18, 2026

Hello, I'm the AEM Code Sync Bot and I will run some actions to deploy your branch.
In case there are problems, just click the checkbox below to rerun the respective action.

  • Re-sync branch
Commits

@hparra hparra marked this pull request as draft April 18, 2026 20:24
@hparra hparra changed the title Add JSON-LD graph manager design doc JSON-LD Graph Manager Apr 18, 2026
@hparra hparra requested a review from Copilot April 18, 2026 20:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design-specification document for a planned JsonLdGraphManager runtime, describing motivation, architecture/lifecycle, canonical graph/merge rules, operational concerns (logging/testing/perf/rollback), and reference examples.

Changes:

  • Introduces a comprehensive JSON-LD graph-manager design spec (feature-flagging, lifecycle, data contracts).
  • Defines normalization/merge/dedupe and provenance conventions for multi-producer JSON-LD aggregation.
  • Documents operational strategy (observability via Lana, testing levels, performance envelope, rollback).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/utils/json-ld.md Outdated
Comment thread libs/utils/json-ld.md Outdated
Comment thread libs/utils/json-ld.md Outdated
Comment thread libs/utils/json-ld.md Outdated
hparra added 2 commits April 18, 2026 19:21
Second pass on the JsonLdGraphManager design doc focused on readability
and presentation flow for a broader audience.

- Restructure into 6 parts (add Part II Rollout) with italic dicta under
  each section heading to anchor the key idea
- Add Quickstart, "Who this is for" audience matrix, and Glossary
- Add Mermaid diagrams: 3-beat architecture flowchart, before/after
  comparison, initialization and mutation sequence diagrams, canonical
  editorial and product page graph shapes
- Annotate Appendix A examples with "What to notice" callouts
- Consolidate all Open Questions into Appendix B table
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager April 20, 2026 21:31 Inactive
Reorganize JsonLdGraphManager spec so the reading order follows the
systems/design-paper convention (why -> what -> does-it-work ->
how-we-ship -> caveats) instead of interleaving deployment before
design.

- Part I Introduction (Abstract, Scope, Problem, Before/After, Contributions)
- Part II Design (Decision, Architecture, Lifecycle, DOM & Output
  Contracts, Producer Integration, Direct-Push API, Normalization,
  Canonical Graph Model)
- Part III Evaluation (Validation Cohort, Testing, Performance)
- Part IV Deployment (Feature Flag, Rollout, Rollback, Observability)
- Part V Security Considerations (promoted to top-level, RFC convention)
- Part VI Related Work & Reference (Authoring Catalog, References,
  Appendices A-D; Glossary moved to appendix)

Specific moves:
- Design Decision moves from Motivation to opener of Design
- Before/After moves from Architecture to Introduction (motivation device)
- Direct-Push API moves from Operations to Design (it's a public interface)
- Validation Cohort + Testing + Performance grouped in Evaluation
- Security promoted from Operations subsection to top-level part
- Glossary moves to Appendix D
- Rename "Data Model And Contracts" -> "DOM And Output Contracts" to
  eliminate name collision with the data-model material in Part II
- Add bulleted Contributions list in Introduction

No content changes; only section relocations, one rename, and the new
Contributions list.
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager April 20, 2026 22:15 Inactive
@github-actions
Copy link
Copy Markdown
Contributor

This PR has not been updated recently and will be closed in 7 days if no action is taken. Please ensure all checks are passing, https://github.com/orgs/adobecom/discussions/997 provides instructions. If the PR is ready to be merged, please mark it with the "Ready for Stage" label.

@github-actions github-actions Bot added the Stale label Apr 28, 2026
Reframe the spec to point at the requirements sheet in
structured-data-json-ld.json as the machine-readable source of truth and
keep the markdown doc as rationale and contract. Remove sections that
restated rules now owned by the JSON sheet; remove provenance entirely
(debug mode is the appropriate place to surface per-source origin).

- Externalize: drop "DOM And Output Contracts" subsections, identity
  policy table, dedupe policy, governing-rules bullets, and the
  "Manager guarantees vs. cohort expectations" table; replace each
  with a one-line pointer to the requirements sheet.
- Provenance: remove the provenance contract subsection, the
  Provenance preservation security bullet, the Provenance glossary
  entry, and all producerName/producerType/ingestMode/discoveryPhase
  references in the Producer Integration Model, Direct-Push API,
  runtime lifecycle, sequence diagram, and testing strategy. Reframe
  observability so debug mode logs the original captured payload and
  DOM location rather than persisting a provenance record.
- Naming: rename section 3 from "Evaluation" to "Conformance" -- the
  doc covers conformance to the requirements spec, not empirical
  evaluation. Rename section 4 from "Deployment" to "Operations" so
  feature flagging and observability sit naturally together.
- Section numbering: collapse the 2.1->2.2->2.3->2.6 gap to a
  contiguous 2.1->2.6 sequence after the renames; add 3.1, 3.2.
- Out of scope: add a 3.2 "Out Of Scope" note clarifying that
  search-engine effectiveness measurement (bot-traffic logs, GSC URL
  Inspection API) is not gated by this spec.
- Cross-references: drop the broken anchor link on the canonical-graph
  section (target was renumbered); drop "direct graph-manager push"
  from the merge priority since the direct-push API is no longer
  specified in this doc; drop BreadcrumbList from Article.hasPart in
  the editorial diagram and Example 1 since it isn't a supplemental
  per the supplemental-linkage rule.
- Typos and grammar: paramater, eachother, this these, fo this,
  compelete, it's complexity, on on, speadsheet, awkward "JSON-LD on
  page meets" wording in the e2e testing bullet.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/utils/json-ld.md Outdated
Comment thread libs/utils/json-ld.md Outdated
Comment thread libs/utils/json-ld.md Outdated
Comment thread libs/utils/json-ld.md Outdated
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager April 28, 2026 03:46 Inactive
Add a single-file ES module at libs/features/jsonld-graph-manager.js that
collects all per-page JSON-LD emitted by existing producers and rewrites it
as one canonical, linked @graph. Disabled by default; enabled per page via
the jsonld-graph-manager metadata flag or URL query parameter (string 'true',
case-insensitive).

The implementation is organized as pure helper functions plus a class, all
in one file, with named exports for unit-testability:

- RULES table encodes the requirements sheet (WebPage, Organization, Article,
  BreadcrumbList, SoftwareApplication, HowTo, FAQPage, VideoObject, Event,
  Product) — identity fragments, singleton flags, and default linkage edges.
- parsePayload: accepts object | array | { @graph } shapes; logs a Lana
  warning on parse failure.
- normalizeNode: strips per-node @context; rewrites @id to canonical
  page-scoped fragment (e.g. #article) or site-wide id (Organization).
- mergeNodes: resolves scalar conflicts by source priority (bootDom < runtime);
  unions reference arrays (hasPart, mainEntity, itemListElement) by @id.
- injectLinks: derives WebPage.mainEntity/breadcrumb/publisher and
  Article.isPartOf/mainEntityOfPage/publisher from the RULES table.
- JsonLdGraphManager class: boot scan of existing unmanaged scripts,
  MutationObserver on documentElement (childList + subtree), debounced
  rebuild queue, and rewrite() that synthesizes a minimal WebPage root
  when producers haven't provided one.
- init() default export: idempotent singleton stored on
  window.__jsonLdGraphManager.

Boot wiring added to documentPostSectionLoading in libs/utils/utils.js —
placed before seotech/richresults so the MutationObserver is attached before
those producers append their scripts.

Tests (37, all passing) cover: flattenPayload, parsePayload (valid shapes +
invalid JSON → Lana warning), normalizeNode (canonical ids, context strip,
unknown type retention), unionByRef, mergeNodes (priority resolution, field
union, reference array union), injectLinks (forward/back links, no-overwrite),
boot scan, singleton enforcement, output contract (one managed script, no
per-node @context, WebPage-first ordering), MutationObserver pickup, and
three e2e pipeline fixtures (editorial, product, multi-producer priority).

What v1 does not include: direct-push producer API, runtime fetch of the
requirements sheet, provenance persistence, e2e cohort tests against live
URLs, search-effectiveness measurement.
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager April 28, 2026 04:17 Inactive
@github-actions
Copy link
Copy Markdown
Contributor

This PR does not qualify for the zero-impact label as it touches code outside of the allowed areas. The label is auto applied, do not manually apply the label.

hparra and others added 3 commits April 27, 2026 21:44
…le logging

Add ?jsonld-graph-manager-debug=true URL flag that emits console.debug output
at each queue lifecycle event: enqueue (source, DOM location, original payload),
rebuild (batch size, graph size), parsed (types, node count), removed from DOM,
and rewrite (node count, full expandable graph object). The graph object logged
on rewrite is the canonical @graph as produced, inspectable in DevTools without
a separate console snippet.

Debug output is gated entirely on the URL param and is independent of lanadebug
and the Lana endpoint -- these are high-volume success-path events that should
never be sent to Lana.
…ug flag doc

Organization synthesis:
- Always ensure a canonical Organization node is present in the graph.
  rewrite() synthesizes a minimal default if none is provided, or merges the
  default at graph-manager-generated priority (weight 2) so baseline fields
  (name, url, logo) always win over producer-supplied values while
  producer-only fields (e.g. sameAs) are preserved.
- Domain-aware: siteRoot() returns https://business.adobe.com for hostnames
  matching /business|bacom/i; defaults to https://www.adobe.com. defaultOrg()
  derives name ("Adobe" / "Adobe for Business"), url, logo, and @id from the
  site root. Both accept an optional hostname override for testability.
- 3-tier merge priority: generated (2) > runtime (1) > bootDom (0).

Inline entity extraction:
- extractInlineEntities() walks publisher, author, creator, provider, brand
  properties; hoists any inline typed object that lacks @id to a top-level
  graph node (via normalizeNode) and replaces the property value with an @id
  reference. Called during rebuild() after each node is normalized.

Doc (libs/utils/json-ld.md):
- Summary: add one-line mention of jsonld-graph-manager-debug=true.
- §4.1: add debug flag entry alongside the feature flag.
- §4.2: replace vague "debug logging conventions" bullets with a concrete
  description of the five lifecycle events logged by the debug param; remove
  stale lanadebug reference.

Tests: 45 passing (8 new cases covering synthesis, precedence, domain
selection for www/business/bacom, inline extraction, and integration).
- Turn off no-continue globally in .eslintrc.js
- Add file-level no-use-before-define disable (lanaLog hoisted above parsePayload)
- Add inline no-nested-ternary disables for unionByRef coercions
- Add missing no-console disables for console.error/warn in lanaLog
- Rename _collect → collect (private method, underscore convention unnecessary)
- Rename window.__jsonLdGraphManager → window.miloJsonLdGraphManager
- Remove unused canonicalUrl import from test file
- Add no-promise-executor-return disable for test microtask flush

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager April 28, 2026 05:46 Inactive
Introduced by a907365: when mergeNodes promoted @type to a SoftwareApplication
subtype (WebApplication / MobileApplication / VideoGame), injectLinks() failed
two ways:

1. WebPage.mainEntity was never set, because the byType index keyed on the
   exact @type and the lookup was 'byType.Article ?? byType.SoftwareApplication'.
   With @type=WebApplication, byType.SoftwareApplication is undefined.

2. provider / isPartOf weren't auto-injected on the SA-subtype node, because
   the linksBack rule lookup was 'RULES[node["@type"]]' and RULES has no
   entry for the subtype.

Fix: introduce effectiveType(t) that maps SA subtypes to 'SoftwareApplication',
and apply it in two places:

- byType build: index the node under both its exact @type AND its effective
  parent (so byType.SoftwareApplication is populated when the node is a subtype)
- linksBack lookup: RULES[effectiveType(node['@type'])] so SA's linksBack
  rules apply to subtypes

Also extend the WebPage.mainEntity primary-type fallback to include
NewsArticle (richresults emits this and it should attach as mainEntity the
same way Article does).

Tests: 71/71 passing (68 + 3 new) covering mainEntity for WebApplication,
auto-provider on WebApplication, and mainEntity for NewsArticle.

Lint: clean.
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager May 8, 2026 01:32 Inactive
Add AggregateRating to the canonical graph as its own top-level node:

- New requirement aggregaterating-singleton (error): at most one
  AggregateRating per page, at the canonical @id
  '{canonicalPageURL}#aggregaterating'.
- New requirement aggregaterating-extraction (info): inline aggregateRating
  values on host entities (SoftwareApplication, Article, Product, etc.) are
  hoisted to the top-level @graph and replaced with { @id } references.
- New section 4.10 AggregateRating: schema.org hierarchy, Google rich-result
  citations (Software App, Product, Course, Review snippet), manager
  handling, known producers (review flow).

Implementation:
- Add AggregateRating: { idFragment: '#aggregaterating', singleton: true }
  to RULES so normalizeNode rewrites the @id.
- Add 'aggregateRating' to ENTITY_PROPS so extractInlineEntities hoists it.

Why singleton: every Adobe.com primary entity that exposes ratings has
exactly one canonical rating; multi-producer contributions describe the
same product (team-hardcoded snapshot vs. live review-block fetch) and
should merge. Source priority resolves freshness — runtime (review block)
wins over bootDom (team hardcode), so the freshest counts surface to
Google's software-app rich result.

Tests: 73/73 passing (71 + 2 new — extractInlineEntities hoisting,
end-to-end merge with bootDom + runtime contributions). One existing
end-to-end assertion updated to expect '{ @id }' instead of inline body.

Lint: clean.
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager May 8, 2026 01:43 Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

This pull request is not passing all required checks. Please see this discussion for information on how to get all checks passing. Inconsistent checks can be manually retried. If a test absolutely can not pass for a good reason, please add a comment with an explanation to the PR.

@github-actions
Copy link
Copy Markdown
Contributor

This PR has not been updated recently and will be closed in 7 days if no action is taken. Please ensure all checks are passing, https://github.com/orgs/adobecom/discussions/997 provides instructions. If the PR is ready to be merged, please mark it with the "Ready for Stage" label.

@github-actions github-actions Bot added the Stale label May 15, 2026
hparra added 2 commits May 15, 2026 09:54
Suppress weak rating signal so consumers (Google rich-results, LLMs, search)
do not surface low-quality ratings under the Adobe brand:

- aggregaterating-min-rating-value (error): ratingValue MUST be >= 3.2
- aggregaterating-min-rating-count (error): ratingCount MUST be >= 100

These are Milo policy thresholds, not Google requirements. Google publishes
no documented minimums for ratingValue or ratingCount on Software App or
Review snippet rich results. The thresholds protect the brand from publishing
poor-but-real ratings (2.x stars) or noisy small-sample ratings (<100
reviewers, statistical accidents).

Implementation: rewrite() checks the canonical AggregateRating node before
serialization. If below threshold, removes the node from the graph AND
deletes any 'aggregateRating' reference from host entities so consumers do
not see a dangling @id.

Also capture softwareapplication-default-offer as an info-severity TODO
under section 3.8. The original framing ('inject default free Offer when
AggregateRating is displayed') is too narrow — Google's Software App
rich-result spec requires offers.price *unconditionally*, plus one of
aggregateRating or review. The TODO is widened to: 'synthesize a default
free Offer on any primary-entity SoftwareApplication that lacks one.' This
matches Google's actual rule and captures the AR case as a subset.

Tests: 80/80 passing (73 + 7 new): aggregateRatingMeetsThresholds unit
tests (pass case, low value, low count, missing/non-numeric, null) plus
three end-to-end cases (low-value drops + reference cleanup, low-count
drops, threshold-meeting rating emits normally).

Lint: clean.
…ication

Complete softwareapplication-default-offer (promoted from info TODO to
error severity): when the page's primary SoftwareApplication (or subtype)
has no offers (missing property or empty array), the manager synthesizes
a default Offer at the canonical @id with price='0', priceCurrency='USD',
availability='https://schema.org/InStock', source='generated'. SA.offers
is set to [{ @id }] reference to the synthesized node.

Why this rule: Google's Software App rich result requires offers.price
*unconditionally*, plus one of (aggregateRating, review). The earlier
framing 'inject when AggregateRating is displayed' is too narrow — it
only fixed the AR-conditional cell of Google's actual rule. Broader
framing also subsumes the SA-with-only-review case and matches the
Adobe.com norm (products are gateway-free with paid tiers; producers
needing non-free Offer supply their own).

Also fix a contradiction surfaced by the new test fixture:
normalizeNode was rewriting ALL Offer @id values to canonical '#offer',
even when the producer supplied a distinct fragment ('#paid',
'#free-trial'). This contradicted repeatable-types ('distinct @id values
from producers are required to materialize multiple instances') and
Appendix A.2 (which shows two offers with distinct fragments). Fix:
for repeatable types, when the producer-supplied fragment differs from
the rule's default fragment, preserve the producer fragment but
canonicalize the URL prefix to the current canonical page URL.

Tests: 86/86 passing (80 + 6 new):
- synthesis on bare SoftwareApplication
- synthesis on bare WebApplication (SA subtype)
- no synthesis when SA already has offers
- no synthesis when no SA on page
- synthesis when offers is empty array
- distinct producer fragments (#paid, #free-trial) preserved
  (codifies the repeatable-types fix)

Lint: clean.
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager May 15, 2026 16:59 Inactive
hparra and others added 5 commits May 15, 2026 12:39
Extend softwareapplication-default-offer: the synthesized free Offer now
carries category: 'Free Trial' so the node is self-describing and
disambiguates from any future producer-supplied paid Offers. Reflects the
Adobe.com norm — primary entry to a product is a free trial of the paid
tier.

Spec, implementation, and test updated in lockstep.

Tests: 86/86 passing.
Lint: clean.
The last breadcrumb is the current page and conventionally has no <a>
tag (UX convention: don't link to the page you're on). The SEO
generator's fallback at that branch used window.location.href, which
includes query parameters and #hash. On a managed-graph audit URL this
leaked our debug params (?milolibs=...&jsonld-graph-manager=...&...)
into structured data. In production it would still leak session params
(?cmd=, ?segment=, etc.) and any current #hash.

The fix prefers <link rel="canonical">, falling back to the bare
origin+pathname of the current URL (query and hash stripped). Every
other ListItem.item in the BreadcrumbList already resolves to the
AEM-authored canonical via <a href>, so the last item now matches.

This is the gnav-side complement to the changes in this branch — the
JSON-LD Graph Manager rewrites cross-page WebPage references to the
canonical page id; this commit fixes the producer that was emitting
non-canonical URLs in the first place.

Tests: 'should create a breadcrumb SEO element' was unintentionally
validating the bug (it asserted item === window.location.href, which
matched whatever the runtime URL happened to be including the wtr
session id query param). Updated both assertions to expect the stripped
origin+pathname form.

Pre-existing test failure 'should localize breadcrumb links' is
unrelated to this change (port 2000 vs 8000 fixture mismatch in test
harness; reproduces on origin code).
…oduction origin

The previous fix only handled the last crumb. Looking at a real audit
(express qr-code-generator on aem.live) revealed all crumbs were aem.live
URLs, not just the last — because the authored <a href> values in
express breadcrumbs are relative ('/express', '/express/feature'), so
link.href resolves against the current document base and gives aem.live.

To mimic production rendering in the structured data regardless of
environment, rewrite every item URL whose hostname matches the current
rendering origin to use the canonical link's origin instead. Always strip
query string and #hash too. External URLs (different hostname) are
preserved but still get query/hash stripped.

Specifically:
- Read <link rel="canonical"> once; derive its origin as the production
  origin.
- For each ListItem: if the URL is same-origin as the current page,
  swap origin to production and strip query/hash. Otherwise keep origin,
  strip query/hash.
- Last-crumb fallback (no <a>): canonical URL when available, otherwise
  current-page origin+pathname (stripped).

On the express qr-code-generator audit URL, every ListItem.item is now
https://www.adobe.com/... with no debug params, ratings token, or
session noise.

Tests: 13 passing (11 prior + 2 new — canonical origin rewrite, external
URL preservation). The pre-existing 'should localize breadcrumb links'
failure is unrelated (port 2000 vs 8000 fixture mismatch; reproduces on
origin code).

Lint: clean.
Revert 129bc95 and 7e16408. Canonicalization of breadcrumb item
URLs belongs in the JSON-LD Graph Manager, not in the gnav producer:
it matches the existing defensive-normalization pattern (e.g.,
'#org' -> '#organization', cross-page WebPage rewriting), and it
keeps the producer-side blast radius scoped to the manager's feature
flag.

A follow-up commit adds canonicalizeBreadcrumbItems to the manager
along with a normative requirement in section 3.7 of the design doc.

This reverts commit 129bc95fb1d8a92dca6e1bf4ee44a8a4e8db8ddc and
commit 7e1640880b5fa1f1ec1fbc05923f1a3b80e0a2b3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Defensive normalization that replaces the (now-reverted) gnav-side fix.
Producers emit what's natural — relative <a href> values that resolve to
the current rendering origin (aem.live, aem.page, branch URLs); on the
last crumb, window.location.href with full query string and #hash. The
manager normalizes these to the canonical production form when ingesting.

New requirement breadcrumblist-items-canonical-origin (info, section 3.7):
for each BreadcrumbList.itemListElement[*].item, the manager rewrites
same-origin URLs to the origin of <link rel='canonical'> and strips query
strings and #hash from every item URL. External-host URLs preserve their
origin. The rewrite is skipped when no canonical link is present.

Why this lives in the manager rather than the producer:
- Same architectural pattern as #org -> #organization canonicalization
  and cross-page WebPage rewriting.
- Behavior gated by the existing jsonld-graph-manager feature flag; no
  blast radius on pages where the manager is off.
- Producer-side fix-up is still preferred long-term; this is defensive.

Implementation:
- canonicalizeBreadcrumbItems(node) helper exported alongside the other
  canonicalize* functions.
- Called per-node in rebuild() alongside rewriteCrossPageRefs and
  canonicalizeReferences.

Tests: 91 passing (86 + 5 new — non-BC no-op, missing-canonical no-op,
same-origin rewrite with query/hash strip, external-host preservation,
end-to-end with a producer-emitted BreadcrumbList on a non-prod hostname).

Section 4.3 updated to mention this manager handling.

Lint: clean.
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager May 15, 2026 22:03 Inactive
@github-actions github-actions Bot removed the Stale label May 16, 2026
hparra and others added 4 commits May 16, 2026 11:47
…uery param

Add an escape hatch so callers can leave specific producer JSON-LD
untouched. When a producer script matches an entry on the ignore list,
the manager does not parse it for ingestion, does not remove it from
the DOM, and does not let it contribute to the managed graph. Off by
default.

Spec (sections 3.7, 3.4, 6.1):
- New normative requirement ignore-types-bypass (info) describing the
  match rules: case-insensitive lowercase comparison against schema.org
  @type values, plus a special pseudo-type 'graph' that matches any
  script whose top-level shape is { '@graph': [...] } regardless of its
  contents.
- Added a rule-interaction aside under section 3.4 documenting that
  webpage-canonical-singleton, organization-singleton, and
  breadcrumblist-singleton remain satisfied via baseline synthesis or
  'when applicable' semantics; required-primary-type is at risk if
  callers ignore their sole primary type.
- Section 6.1 extended with the new ignore-types flag and example.
- Section 6.2 lists a new 'ignored' debug event.

Implementation (libs/features/jsonld-graph-manager/jsonld-graph-manager.js):
- parseIgnoreParam(search) reads the comma-separated query parameter,
  trims, lowercases, and drops empty entries. Exported.
- shouldIgnoreScript(scriptEl, ignoreTypes) parses the script JSON,
  detects the @graph pseudo-type, walks top-level @type values, and
  returns true if any match. When a script has mixed types and only
  some match, the whole script is bypassed and a Lana warn is emitted
  recommending split-or-use-'graph'. Exported.
- enqueue() gates on shouldIgnoreScript before queueing. Applies to
  both bootDom and runtime entry paths (both end at enqueue).
- JsonLdGraphManager constructor accepts ignoreTypes option for
  testability; default falls back to module-level IGNORE_TYPES parsed
  from the URL.

Tests: 102 passing (91 + 11 new) — parseIgnoreParam empty/whitespace,
case-insensitive matching, @graph pseudo-type, unparseable JSON,
end-to-end script-in-DOM preservation, sibling-not-affected, @graph
bypass, mixed-type Lana warn, runtime/MutationObserver path.

Lint: clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
flattenPayload only handled the common case where a script's entire
content was { '@context', '@graph': [...] }. Two edge cases leaked
producer-side wrapper objects (or lost typed-wrapper fields) into the
managed output:

- Case A: a script whose top-level content is an array containing a
  graph wrapper, e.g. [{Article}, {'@graph': [Video]}]. The wrapper
  passed through normalizeNode without an @type, got keyed by
  JSON.stringify, and appeared as a node in the managed @graph carrying
  a residual '@graph' property. The Video inside was stranded.

- Case B: a typed object that also carried '@graph', e.g.
  { '@type': 'WebPage', name: 'X', '@graph': [...] }. The old code
  returned only the inner '@graph' contents and silently dropped the
  WebPage's name and other top-level fields.

Fix: make flattenPayload recursive. Arrays flatMap through flattenPayload.
Objects with '@graph' yield their inner contents (flattened), plus the
wrapper-minus-'@graph'-and-'@context' as a sibling iff it has '@type'.
Nested wrappers flatten to any depth. The managed @graph is now
guaranteed to contain no node carrying its own '@graph' property.

Spec section 2.3 updated to make 'recursively flattened' explicit and
to document the typed-wrapper split.

Tests: 107 passing (102 + 5 new) — array-with-wrapper, typed-with-graph
preserving fields, nested wrappers, pure wrapper, plus an end-to-end
managed-graph assertion confirming no embedded '@graph' property leaks.

Lint: clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Real-world repro from the express qr-code-generator page: the team ships
a pre-baked graph as [{ @context, @graph: [WebPage, SoftwareApplication,
BreadcrumbList, FAQPage, ... ] }] — an array wrapping a single graph
container. With jsonld-graph-manager-ignore=graph the user expected the
whole script to be bypassed; instead, the manager ingested every node
inside the wrapper.

Cause: shouldIgnoreScript only treated the script as a graph container
when the parsed JSON itself was { '@graph': [...] }. The array branch
walked each element looking for a string @type, found none on the
wrapper, and returned false. The script entered the queue and the
recursive flatten unpacked every inner node into the managed graph.

Fix: rewrite shouldIgnoreScript around a unified 'match ids' model.
Each top-level item (the parsed content itself if it's an object, or
each element if it's an array) contributes up to two ids:
- 'graph' if the item has an @graph property
- lowercase @type if the item carries a string @type

A script is bypassed when any id is on the ignore list. Mixed cases —
some ids match, some don't — still bypass the whole script and emit the
existing Lana warning. The data-milo-jsonld='graph' attribute on the
manager's output script ensures consumers can always distinguish the
manager-emitted graph from a bypassed producer script in DOM.

Semantic refinement: the @graph wrapper is no longer 'transparent' for
type matching — its inner @types do not satisfy a type-name ignore on
their own. Callers who want to bypass a wrapped graph by inner type
should include 'graph' in the ignore list. This brings shouldIgnoreScript
in line with the existing spec language ('a match is the pseudo-type
graph') rather than the previous implementation's leakier behavior. One
test asserted the leaky behavior (line 1401) but its inline comment
already described the new correct behavior — assertion now matches the
comment.

Spec section 3.7 updated to describe the match-ids model and the
data-milo-jsonld='graph' marker.

Tests: 110 passing (107 + 3 new) — array-wrapped @graph match, @type+@graph
single object with mixed warning, end-to-end DA Express scenario with the
exact shape from the production repro.

Lint: clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous fix added array-wrapped-wrapper detection for the 'graph'
pseudo-type but only inspected top-level @type values for name-based
matching. That fell short of the user's intent: if 'breadcrumblist' is
on the ignore list and a producer's pre-baked graph contains a
BreadcrumbList nested inside an @graph wrapper, the script should be
bypassed — same as a free-standing BreadcrumbList script.

Fix: type-name matching now uses the same recursive flattenPayload pass
that ingestion uses. Every @type discovered at any depth (top level or
inside @graph wrappers, even nested wrappers) is considered for the
match. The 'graph' pseudo-type retains its short-circuit semantics —
if a wrapper exists at the top level and 'graph' is on the ignore list,
the script is bypassed immediately with no further analysis (and no
mixed-types warning).

When the recursive type set contains both matched and unmatched ids,
the whole script is still bypassed but a Lana warning is logged with
the kept ids surfaced. Example: producer ships [{@graph: [WebPage,
BreadcrumbList]}] and user passes ignore=breadcrumblist — the script
is bypassed and a warning lists 'webpage' as also dropped.

Two earlier test assertions were renamed and flipped to match the new
recursive semantics — they had been validating the prior leaky behavior.

Spec section 3.7 updated: the description now distinguishes the 'graph'
short-circuit from recursive type-name matching and explicitly mentions
that nested @types in @graph wrappers count.

Tests: 111 passing (110 + 1 new) — added an end-to-end test that
reproduces the user's stated case (pre-baked array-wrapped graph + ignore
on a nested type) and verifies the bypass plus the mixed-types warning.

Lint: clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager May 16, 2026 19:27 Inactive
The manager's previous 'bootDom' source label only meant 'present when
the manager initialized' — by that point, block decoration had already
emitted runtime scripts that looked indistinguishable from HTML-authored
ones. Future policy work (e.g., 'leave HTML-authored scripts alone, only
manage runtime-emitted ones') needs an authoritative HTML-vs-runtime
signal.

Take a WeakSet snapshot of every JSON-LD script in <head> at the very
top of loadArea(document) — before setCountry, before checkForPageMods,
before any section decoration runs. Anything in the snapshot was in the
raw HTML; anything that arrives later was emitted by a Milo block or
feature.

Implementation:
- New libs/utils/jsonld-ns.js holds the small read-side API for the
  shared 'window.miloJsonLd' namespace: jsonLdNs(), snapshotHtmlJsonLd(),
  isHtmlJsonLd(). Idempotent snapshot; repeated calls are no-ops.
- libs/utils/utils.js: inline three-line snapshot at the top of loadArea
  (the file deliberately avoids static imports; dynamic-import would
  introduce a microtask gap during which producers could sneak scripts
  in, so the snapshot is inlined).
- libs/features/jsonld-graph-manager/jsonld-graph-manager.js: rename
  the singleton handle 'window.miloJsonLdGraphManager' to
  'window.miloJsonLd.manager' so the namespace holds both authored
  scripts and the manager instance under one key instead of two.
- Test reset helper updated for the new namespace.

No behavior change yet — the manager doesn't consume isHtmlJsonLd()
anywhere. This commit just establishes the signal so subsequent work
(e.g., a 'don't manage HTML-authored scripts' policy) has reliable data.

Tests: 117 passing (111 manager + 6 new for jsonld-ns covering
snapshot capture, runtime exclusion, idempotency, explicit-root, and
pre-snapshot no-op).

Lint: clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

This PR has not been updated recently and will be closed in 7 days if no action is taken. Please ensure all checks are passing, https://github.com/orgs/adobecom/discussions/997 provides instructions. If the PR is ready to be merged, please mark it with the "Ready for Stage" label.

@github-actions github-actions Bot added the Stale label May 24, 2026
@aem-code-sync aem-code-sync Bot temporarily deployed to hgpa/jsonld-graph-manager May 29, 2026 22:16 Inactive
hparra added 2 commits May 29, 2026 15:23
Lets local agent skill folders (now linked under .agents/skills/) stay
out of git status without tracking the symlinks.
…TML JSON-LD

Initialize JsonLdGraphManager near the top of loadArea(document) — after
canonical URL finalization, before any block/feature decoration — instead
of in documentPostSectionLoading. The manager's boot scan already captures
the JSON-LD present in the DOM as `bootDom` and its MutationObserver
captures everything emitted afterward as `runtime`, so it is its own
HTML-vs-runtime signal.

This makes the separate snapshot mechanism redundant: remove
libs/utils/jsonld-ns.js (jsonLdNs/snapshotHtmlJsonLd/isHtmlJsonLd), the
inline WeakSet snapshot in loadArea, and the htmlJsonLd reset in the
manager test helper. No consumer read isHtmlJsonLd(), so dropping the
always-on global state and unused signal is a net simplification.

Tradeoff: init now runs after checkForPageMods (canonical URL must be
final for page-scoped @id derivation), so MEP-injected JSON-LD lands in
the boot scan as `bootDom` rather than `runtime`. The signal only feeds
merge priority today, which is rarely contested for these producers.

Tests: 111 manager + 161 utils passing. Lint clean.
@github-actions github-actions Bot removed the Stale label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants