Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,30 @@ repos:
files: ^risk-map/yaml/personas\.yaml$
pass_filenames: true

# ---------------------------------------------------------------------------
# Prose authoring linters. Both consume the shared _prose_tokens.py tokenizer
# and ship warn-only (default; block-mode flip is C2 / sweep-close).
# validate-yaml-prose-subset enforces ADR-017 D4 grammar rejections;
# validate-prose-references enforces ADR-016 D6 sentinel ID resolution +
# bare-camelCase rejection. Diagnostics emit to stderr in the format
# `<hook-id>: <file>:<entry-id>:<field>[<index>]: <reason> at <token-snippet>`.
# ---------------------------------------------------------------------------
- repo: local
hooks:
- id: validate-yaml-prose-subset
name: 'validate: YAML prose authoring subset'
language: system
entry: python3 scripts/hooks/precommit/validate_yaml_prose_subset.py
files: ^risk-map/yaml/(components|controls|risks|personas)\.yaml$
pass_filenames: true

- id: validate-prose-references
name: 'validate: YAML prose references (sentinels + IDs)'
language: system
entry: python3 scripts/hooks/precommit/validate_prose_references.py
files: ^risk-map/yaml/(components|controls|risks|personas)\.yaml$
pass_filenames: true

# ---------------------------------------------------------------------------
# Issue template regeneration. Runs before the validator so freshly
# regenerated templates are validated in the same commit. Triggers on
Expand Down
6 changes: 3 additions & 3 deletions docs/adr/016-reference-strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Bare identifiers, anchored HTML, and inline URLs are all retired for prose menti

The sentinel carries an identifier only. Display text (the entity's `title` for the intra-document form; the `externalReferences[].title` for the reference form) is resolved at generation or render time from the referenced entry; authors do not hand-write titles inside sentinels, which eliminates a rename-drift class.

The grammar is unambiguously machine-parseable: a sentinel matches `\{\{(ref:)?[A-Za-z0-9_-]+\}\}`, and the namespace prefix is either present (`ref:`) or absent. The two forms cannot collide because `ref:` is a reserved prefix that no entity identifier can take (entity identifiers are camelCase, never carry colons).
The grammar is unambiguously machine-parseable: a sentinel matches `\{\{(ref:)?[A-Za-z0-9_.\-]+\}\}`, and the namespace prefix is either present (`ref:`) or absent. The two forms cannot collide because `ref:` is a reserved prefix that no entity identifier can take (entity identifiers are camelCase, never carry colons). The dot is included in the identifier character class to support canonical-form sub-technique IDs (e.g., `AML.T0040.001` from MITRE ATLAS, `T1059.003` from MITRE ATT&CK) when used as `externalReferences[].id` values; the entity-prefix forms (`riskXxx`, `controlXxx`, etc.) are camelCase and do not use the dot.

Chosen over bare `{{cwe-89}}` (no namespace prefix, relying on uniqueness of external IDs against entity enums) because external IDs are author-chosen at the per-entry scope. Two entries can each define an `externalReferences` entry with `id: cwe-89`; the sentinel needs to know it is resolving against the *entry's* external list, not against a global enum. The `ref:` prefix makes the resolution scope explicit and keeps the linter's job a local lookup rather than a cross-entry uniqueness assertion.

Expand Down Expand Up @@ -100,7 +100,7 @@ This mirrors CWE's `Related_Weaknesses` pattern: structured citation with a type

### D4. External references — sentinel-only prose, no inline URLs

Every outbound URL referenced in prose lives in the entry's `externalReferences` array. Prose references it via the sentinel form `{{ref:identifier}}` (D2), where `identifier` matches the structured entry's `id`. **No inline URLs of any form are permitted in prose** — not raw `https://` strings, not `<a href>` tags, not `[text](https://…)` markdown links. The `[text](url)` form is not part of [ADR-017](017-yaml-prose-authoring-subset.md)'s authoring subset.
Every outbound URL referenced in prose lives in the entry's `externalReferences` array. Prose references it via the sentinel form `{{ref:identifier}}` (D2), where `identifier` matches the structured entry's `id`. **Inline URL syntaxes of any scheme are not permitted in prose.** The rule is categorical: if a token in prose carries a URI scheme, the linter rejects it and the URL must move into `externalReferences`. [ADR-017](017-yaml-prose-authoring-subset.md) D4 rule 2 owns the exact tokenizer-level shape (RFC-3986 scheme-with-authority regex plus a named list for opaque-data schemes such as `mailto:`, `javascript:`, `data:`, `tel:` that lack `//` and would otherwise escape the primary regex), as the authoring-rules surface; the shared tokenizer at `scripts/hooks/precommit/_prose_tokens.py` is the implementation. Raw `<a href>` tags are blocked separately by ADR-017's HTML-tag rule. The categorical phrasing exists by design: a 3-form enumeration relies on a detection step the codebase does not have, so a non-enumerated scheme would slip through both linters as plain TEXT and ship to published artifacts. The threat profiles differ across schemes — `mailto:` and `tel:` are contact-exposure vectors, `javascript:` and `data:` are defense-in-depth XSS vectors, `ftp://` and `file://` are dead-scheme drift signals, `gs://`/`s3://`/`arn:` are legitimate cloud references that still belong in structured fields per [ADR-014](014-yaml-content-security-posture.md) P3 — but the architectural answer is the same in every case: a scheme in prose means a structured entry was missed.

The author flow is: add the structured entry first (with `type`, `id`, `title`, `url`), then reference it by sentinel in prose. Examples:

Expand Down Expand Up @@ -185,7 +185,7 @@ For `externalReferences.type`, the enum is declared in this ADR (D3). For per-ty
| D3 per-entry `id` uniqueness | `validate_prose_references.py` (block) or schema | Machine-enforced (new) |
| D3 per-type `id` regex pattern | schema regex (single source in `external-references.schema.json`) | Machine-enforced once conformance sweep authors patterns ([ADR-022](022-supporting-schemas.md) D5b) |
| D3 `url` is `https://` | schema regex (single source in `external-references.schema.json`) | Machine-enforced (new) |
| D4 no inline URL in prose (any form) | `validate_prose_references.py` (block) | Machine-enforced (new) |
| D4 no inline URL syntaxes of any scheme in prose (categorical rule in [ADR-017](017-yaml-prose-authoring-subset.md) D4 rule 2) | `validate_prose_references.py` (block) + `validate-yaml-prose-subset` (block) | Machine-enforced (new) |
| D5 sentinel expansion in generators | generator build failure on unresolved sentinel | Machine-enforced (new) |
| D6 no raw HTML tags in prose | `validate_prose_references.py` (block) + ADR-017 | Machine-enforced (new) |
| D6 no bare camelCase IDs in prose | `validate_prose_references.py` (block) | Machine-enforced (new) |
Expand Down
8 changes: 6 additions & 2 deletions docs/adr/017-yaml-prose-authoring-subset.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,11 @@ A new local hook lands under `.pre-commit-config.yaml` per [ADR-013](013-site-pr
- **Rejection format:** stderr line per offending paragraph: `validate-yaml-prose-subset: <file>:<entry-id>:<field>[<index>]: <reason> at <token-snippet>`. Exit non-zero on any rejection.
- **Rule list (block-mode end state):**
1. Accept `**bold**` (one nesting level), `*italic*`, `_italic_`, and the sentinel forms decided in [ADR-016](016-reference-strategy.md).
2. **Reject any prose containing `http://`, `https://`, or `]` followed by `(`.** The first two catch raw URLs and autolink-style mentions; the `](` pair catches markdown link syntax. This is the unconditional inline-URL block.
2. **Reject any prose carrying a URI scheme inline.** The rule is categorical, not an enumeration. The tokenizer applies two patterns:
- **Primary — scheme-with-authority:** `\b[a-z][a-z0-9+.\-]*://\S+` (RFC-3986 form). Catches `http://`, `https://`, `ftp://`, `file://`, `gs://`, `s3://`, `ssh://`, and any future scheme that follows the authority-bearing shape. Plus the markdown-link suffix `]` followed by `(`, which catches `[text](url)` regardless of the URL's scheme.
- **Secondary — opaque-data schemes:** a named list for colon-only schemes that lack `//` authority and would escape the primary regex. At minimum: `mailto:`, `javascript:`, `data:`, `tel:`. RFC-3986 permits the `scheme:opaque-data` form for any scheme; the named list captures the schemes browsers actively render as clickable actions or execute (contact-exposure for `mailto:`/`tel:`; XSS-class defense-in-depth for `javascript:`/`data:`). Adding to the named list is a tokenizer change with a one-line ADR amendment, not an architectural revisit.

The categorical shape exists by design. A 3-form enumeration (`http://`, `https://`, `](`) relies on a detection step the codebase does not have: the wrapper linters add no URL handling beyond the tokenizer, schemas validate structured-field shape rather than free prose content, and generators read prose verbatim. A non-enumerated scheme (`mailto:`, `gs://`, `javascript:`, etc.) would pass both linters as plain TEXT and ship to published artifacts. The categorical regex closes that gap uniformly and removes the per-scheme adjudication the prior enumeration would force on contributors and reviewers.
3. Reject raw HTML tags (any `<` followed by an alphabetic character or `/`).
4. Reject markdown headings, list markers, code, images, blockquotes, and tables per the D2 table.
5. Reject bare camelCase identifiers outside sentinels (delegated to [ADR-016](016-reference-strategy.md)'s reference linter, which shares the tokenizer per D5).
Expand Down Expand Up @@ -121,7 +125,7 @@ Both surfaces are pointers to the same rules; the ADR is the decision, the doc i
| D1 `*italic*` and `_italic_` allowed | `validate-yaml-prose-subset` (accept) | Machine-enforced (new) |
| D1 sentinel forms allowed (grammar) | `validate-yaml-prose-subset` (accept token shape) | Machine-enforced (new) |
| D1 sentinel ID resolves to enum or `externalReferences` | `validate_prose_references.py` ([ADR-016](016-reference-strategy.md)) | Machine-enforced (new, ADR-016) |
| D2 inline URLs blocked unconditionally (`http://`, `https://`, `](`) | `validate-yaml-prose-subset` (block) | Machine-enforced (new) |
| D2 inline URLs blocked unconditionally (categorical: any scheme-with-authority via `\b[a-z][a-z0-9+.\-]*://\S+`, opaque-data schemes via named list `mailto:`/`javascript:`/`data:`/`tel:`, plus `](` for markdown links) | `validate-yaml-prose-subset` (block) | Machine-enforced (new) |
| D2 raw HTML tags blocked | `validate-yaml-prose-subset` (block) | Machine-enforced (new) |
| D2 markdown headings / lists / code / images / blockquotes / tables blocked | `validate-yaml-prose-subset` (block) | Machine-enforced (new) |
| D2 bare camelCase identifiers blocked | `validate_prose_references.py` ([ADR-016](016-reference-strategy.md)) | Machine-enforced (new, ADR-016) |
Expand Down
93 changes: 93 additions & 0 deletions scripts/hooks/precommit/_linter_types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
Shared types for the ADR-017/ADR-016 prose wrapper linters.

Both validate_yaml_prose_subset and validate_prose_references import from here
to ensure the NamedTuple shapes are identical across the two linters.
"""

import sys
from pathlib import Path
from typing import NamedTuple

# Ensure the scripts/hooks directory is on sys.path so ``precommit.*`` imports
# work when this file is executed or imported directly without the package on path.
_HOOKS_DIR = Path(__file__).resolve().parent.parent
if str(_HOOKS_DIR) not in sys.path:
sys.path.insert(0, str(_HOOKS_DIR))

from precommit._prose_tokens import Token # noqa: E402


class ProseField(NamedTuple):
"""A single prose field value extracted from a YAML entry.

Attributes:
file_path: Path to the source YAML file.
entry_id: Value of the entry's 'id' field.
field_name: Schema property name (e.g. 'shortDescription').
index: Position of this string within its containing array (0-based),
or None if the field value is a bare scalar (not currently used
by the real schemas, but kept for forward-compatibility).
raw_text: The decoded string value as returned by PyYAML.
tokens: Token stream produced by tokenize(raw_text).
nested_index: When the prose field shape uses one level of nesting
(``items: oneOf [string, array<string>]``) and this record
came from an inner-list string, this is the inner index
within that list; ``index`` then holds the outer index.
``None`` for flat-array entries and bare-scalar entries.
"""

file_path: Path
entry_id: str
field_name: str
index: int | None
raw_text: str
tokens: list[Token]
nested_index: int | None = None


class Diagnostic(NamedTuple):
"""A single lint finding from a prose wrapper linter.

Attributes:
hook_id: Pre-commit hook identifier (e.g. 'validate-yaml-prose-subset').
file_path: Path to the YAML file that contains the violation.
entry_id: ID of the YAML entry where the violation was found.
field_name: Schema property name of the violating prose field.
index: Paragraph index within the array (0-based); None only if
field is a bare scalar.
reason: Human-readable description of the violation.
"""

hook_id: str
file_path: Path
entry_id: str
field_name: str
index: int | None
reason: str


class IdIndex(NamedTuple):
"""Index of all known entity IDs and per-entry externalReferences IDs.

Built by validate_prose_references.build_id_index() from the YAML corpus.
All entity sets are frozensets (immutable after construction). The ext_refs
dict maps each entry's ID to the frozenset of its externalReferences[].id
values (per-entry scope per ADR-016 D2).

Attributes:
risks: All known risk IDs.
controls: All known control IDs.
components: All known component IDs.
personas: All known persona IDs.
ext_refs: entry_id → frozenset of externalReferences[].id values.
"""

risks: frozenset[str]
controls: frozenset[str]
components: frozenset[str]
personas: frozenset[str]
# `ext_refs` is a mutable dict by type but treated as read-only after `build_id_index()` returns.
# Not wrapped in MappingProxyType to keep the type signature simple; if this becomes a bug
# source, wrap at the build_id_index return boundary.
ext_refs: dict[str, frozenset[str]]
Loading