Skip to content

Conformance sweep A3: prose tokenizer + linters + ADR amendments#262

Open
davidlabianca wants to merge 8 commits intocosai-oasis:mainfrom
davidlabianca:feature/A3-linters-foundation
Open

Conformance sweep A3: prose tokenizer + linters + ADR amendments#262
davidlabianca wants to merge 8 commits intocosai-oasis:mainfrom
davidlabianca:feature/A3-linters-foundation

Conversation

@davidlabianca
Copy link
Copy Markdown
Contributor

@davidlabianca davidlabianca commented May 1, 2026

Conformance sweep A3: prose tokenizer + linters + ADR amendments

Closes #247 (see also scope addendum comment)

Folds in the prose-shape coverage fix triggered by #262 review (shrey-bagga, ITERATE) — investigation, decisions, and verification harness confirmed the issue and executed fix.

Summary

  • Authors a shared prose tokenizer (scripts/hooks/precommit/_prose_tokens.py) per ADR-017 D5 — single source of truth for the YAML prose grammar with a partition-of-input invariant locked across 42 fixture pairs + 122 unit tests.
  • Adds two consumer linters that ride the shared tokenizer:
    • validate_yaml_prose_subset.py per ADR-017 D4 — accepts the canonical authoring subset (**bold**, *italic*/_italic_, sentinels) and rejects out-of-subset productions (inline URLs, raw HTML tags, markdown lists/headings/code/images/blockquotes/tables, folded-bullet drift heuristic per ADR-020 D4)
    • validate_prose_references.py per ADR-016 D6 — resolves {{idXxx}} and {{ref:identifier}} sentinels against schema enums + externalReferences[].id; rejects bare-camelCase IDs in prose; rejects raw inline URLs
  • Both linters ship in .pre-commit-config.yaml as warn-only hooks; emit <reason> at <token-snippet> diagnostics; --block toggle deferred to the sweep-closing commit.
  • Folds in ADR-016 / ADR-017 amendments (originally plan-tracked at § 2.7.2 + § 2.7.3) — the divergences were surfaced by A3 implementation and the tokenizer file lives in this branch, so co-locating amendments + impl in one PR makes end-to-end coherence reviewable. § 2.7 collapsed back to just 2.7.1 (block-flip).
  • Closes prose-field shape-coverage gap surfaced in Conformance sweep A3: prose tokenizer + linters + ADR amendments #262 review (shrey-bagga, ITERATE): both linters under-covered riskmap.schema.json#/definitions/utils/text (array<string | array<string>>). Lifts find_prose_fields / _collect_entries / _infer_schema_name / ProseField into a new shared scripts/hooks/precommit/_prose_fields.py (mirrors _prose_tokens.py pattern), adds nested-list traversal, and admits file-level wrapper description fields that live outside the entity array. New nested_index: int | None field on ProseField keeps the existing diagnostic format additive. Verified by 27 behavioral tests in scripts/hooks/tests/test_prose_field_shape_coverage.py (TDD red→green) covering 5 corpus sites today (3 nested-list in risks.yaml, 2 file-level wrapper in components.yaml + risks.yaml); duplication across the two wrappers reduced by ~400 lines.
  • 8 commits, 147 files, +7918 / -5; 1957 pass / 6 skip; ruff lint + format-check clean. Smoke test on the live corpus reports 177 subset + 170 references warn-only diagnostics (was 144 + 150 pre-shape-fix; +33 / +20 from the new traversal paths) — expected and intentional; the corpus migrates to canonical form in later content-migration sub-PRs.

Commit-by-commit

# SHA Subject Role
1 e9f5301 feat(precommit): add shared _prose_tokens.py tokenizer (ADR-017 D5) SWE — tokenizer module
2 af882ec feat(precommit): add prose-subset + prose-references warn-only linters SWE — wrapper linters
3 467d5f4 feat(precommit): wire prose-subset and prose-references hooks (warn-only) SWE — .pre-commit-config.yaml
4 697c76f chore(adr): ADR-016 D2 grammar + categorical inline-URL rule (016/017 D4) architect — ADR amendments
5 0c4d4fa feat(precommit): extend tokenizer to enforce categorical inline-URL rule SWE — tokenizer extension implementing commit 4's categorical rule
6 e1fe955 test(precommit): add prose-field shape-coverage probe (TDD red) testing — behavioral red harness for shape-coverage gap
7 72ad4ae feat(precommit): extract _prose_fields.py helper; admit nested + wrapper prose SWE — TDD green; closes #262 review gap
8 f2abe12 test(precommit): add over-deep nesting probe; structural linter-symmetry assert testing — pin one-level invariant; replace symmetry probes with structural identity

Architect-tagged vs. SWE-tagged commits

Commit 4 (chore(adr): prefix) is architect-authored ADR text edits to docs/adr/016-reference-strategy.md and docs/adr/017-yaml-prose-authoring-subset.md. Commits 1, 2, 3, and 5 (feat(precommit): prefix) are SWE-authored implementation. The two work-types are intentionally distinct commits to keep agent-attribution legible. The overall PR description (this body) leads with the expanded scope so reviewer expectations are set up-front.

Commits 6, 7, and 8 land the prose-shape coverage fix in TDD red→green→refactor order: commit 6 (testing) authors a behavioral test file that is intentionally RED against HEAD-of-commit-6 (no source change yet); commit 7 (SWE) lifts the duplicated find_prose_fields block into _prose_fields.py, adds the nested-list branch and the file-level wrapper branch, and re-imports the helper into both wrappers — the test file becomes green at HEAD-of-commit-7; commit 8 (testing) adds an over-deep nesting probe and replaces the parametrised linter-symmetry suite with a single structural identity assert (since both wrappers now share the helper, symmetry is structural, not behavioral). The split makes the gap measurable as a real before/after diff and keeps testing/SWE attribution legible across the fix.

Scope boundaries

In scope (this PR):

  • New: scripts/hooks/precommit/_prose_tokens.py + _linter_types.py (shared tokenizer + types)
  • New: scripts/hooks/precommit/validate_yaml_prose_subset.py + validate_prose_references.py (warn-only linters)
  • Edit: .pre-commit-config.yaml (2 new hook slots)
  • Edit: docs/adr/016-reference-strategy.md D2 + D4 (sentinel grammar; categorical URL rule reframing)
  • Edit: docs/adr/017-yaml-prose-authoring-subset.md D4 rule 2 (3-form enumeration → categorical regex + opaque named list)
  • New tests: scripts/hooks/tests/test_prose_tokens.py, test_validate_yaml_prose_subset.py, test_validate_prose_references.py + fixtures/prose_subset/ (42 pairs) + fixtures/wrapper_linters/ (23 pairs)
  • New: scripts/hooks/precommit/_prose_fields.py — shared helper hosting ProseField, find_prose_fields, _collect_entries, _infer_schema_name, _iter_prose_strings, _find_wrapper_prose_field_names_in_schema
  • Edit: validate_yaml_prose_subset.py + validate_prose_references.py — replace local find_prose_fields blocks with from ._prose_fields import … (eliminates ~400 lines of duplicated iteration code from the two wrappers)
  • New tests: scripts/hooks/tests/test_prose_field_shape_coverage.py — behavioral coverage probe across all five utils/text shapes plus over-deep nesting and structural-symmetry checks (27 tests total)

Out of scope (deferred):

  • Block-mode flip (warn → block) → the sweep-closing sub-PR (plan § 2.7.1; was § 2.7 with 2.7.2/2.7.3 before this PR folded those in)
  • Identification-questions lint → A5 (already merged-or-merging at PR Conformance sweep A5: validate-identification-questions lint #244)
  • YAML content migration → the content-migration sub-PRs (B1, B2)
  • Schema changes → A2's territory
  • Validator extensions (lifecycle-stage uniqueness, controls↔components mirror, etc.) → A4
  • Builder updates (sentinel expansion in generators) → A7

Test plan

  • pytest scripts/hooks/tests/test_prose_tokens.py — 122 tokenizer tests + 50 categorical-rule tests = 172 total, all green
  • pytest scripts/hooks/tests/test_validate_yaml_prose_subset.py — wrapper tests pass
  • pytest scripts/hooks/tests/test_validate_prose_references.py — wrapper tests pass
  • pytest scripts/hooks/tests/test_prose_field_shape_coverage.py -v — 27 instances green (post-flip behavioral assertions across 6 shape classes including new TestShapeWrapperDescription and TestShapeOverDeepNesting; structural symmetry replacement)
  • Full repo regression: pytest reports 1957 pass / 6 skip (post-cascade baseline; A1+A3 + shape-coverage)
  • ruff check . && ruff format --check . — clean
  • pre-commit run --all-files — runs warn-only against current corpus; emits ~177 subset diagnostics + ~170 references diagnostics (was ~144 + ~150 pre-shape-fix; +33 / +20 from the new nested-list traversal — warn-only is intentional during the sweep; corpus migration follows in B1/B2)
  • Confirm both linters import find_prose_fields from _prose_fields (no local copies remain in either wrapper); subset_module.find_prose_fields is references_module.find_prose_fields is enforced structurally by TestLinterSymmetry
  • Spot-check the 5 corpus sites now flow through both linters: risk-map/yaml/risks.yaml:706, :981, :1245 (nested-list <strong> HTML) emit diagnostics; risk-map/yaml/components.yaml:19, risk-map/yaml/risks.yaml:24 (file-level wrapper description) walk under wrapper entry_id = <yaml-stem> and tokenize cleanly (clean prose, no diagnostics)
  • Spot-check the categorical URL rule: tokenizer rejects ftp:// / gopher:// / mailto: / javascript: / data: / tel: (catches authority-bearing schemes via the scheme-with-authority regex, opaque schemes via the named list); accepts **bold** / *italic* / {{idRiskPromptInjection}} / {{ref:cwe-89}}
  • Spot-check sentinel precedence: a URL adjacent to {{ref:cwe-89}} does not swallow the sentinel into the URL token's value
  • Spot-check ADR amendments: ADR-016 D2 prose grammar [A-Za-z0-9_.\-]+ matches the schema regex set; ADR-017 D4 rule 2 prose describes the categorical pair (regex + named list); ADR-016 D4 reframed to defer the tokenizer-level shape to ADR-017 D4 rule 2
  • Confirm bare-camelCase regex distinguishes "id appears as YAML field value" (legit) from "id appears in prose string" (block) — fixture cases prove it; this is the highest-risk false-positive class

Co-Authored-By: AI Assistant ai-assistant@coalitionforsecureai.org

davidlabianca and others added 5 commits April 30, 2026 21:34
Single source of truth for the YAML prose grammar; consumed by the
forthcoming validate_yaml_prose_subset.py and validate_prose_references.py
linters. 106 tests + 42 fixture pairs lock the contract; partition-of-input
invariant verified across every fixture.

Refs cosai-oasis#247.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
Both wrappers (ADR-017 D4 / ADR-016 D6) consume the shared
_prose_tokens.py tokenizer; emit `<reason> at <token-snippet>`
diagnostics; --block toggle defers to C2. Schema-driven prose-field
discovery walks nested objects (tourContent). 154 tests; suite 1584/6.

Refs cosai-oasis#247.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…nly)

Adds two .pre-commit-config.yaml slots invoking the wrappers from
be1313b on every commit touching the four content YAMLs. Both ship
warn-only (no --block); diagnostics emit to stderr; exit 0 unless C2
flips them to block.

Refs cosai-oasis#247.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
… D4)

Amend ADR-016 D2 sentinel grammar to [A-Za-z0-9_.\-]+ (the dot supports
canonical sub-technique IDs like AML.T0040.001 used as externalReferences[].id;
entity-prefix forms remain camelCase). Reframe ADR-016 D4 to state the
no-inline-URL rule categorically and defer the tokenizer-level shape to
ADR-017 D4 rule 2. Reshape ADR-017 D4 rule 2 from a 3-form enumeration to a
categorical pair: a RFC-3986 scheme-with-authority regex
(\b[a-z][a-z0-9+.\-]*://\S+) plus an opaque-data named list for colon-only
schemes that lack // (mailto:, javascript:, data:, tel:); the markdown-link
suffix rule is preserved.

Refs cosai-oasis#247.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
Extend _prose_tokens.py to enforce the categorical inline-URL rule per
ADR-017 D4 rule 2: a primary RFC-3986 scheme-with-authority regex
\b[a-z][a-z0-9+.\-]*://[^\s{]+ catching every authority-bearing scheme
(http, https, ftp, file, gs, s3, ssh, git+https, etc.) plus an opaque-data
named list (mailto:/javascript:/data:/tel:). Both regexes are case-insensitive
(re.IGNORECASE). The [^\s{]+ stop-at-brace variant preserves sentinel
precedence — URLs adjacent to {{ref:...}} stop at the brace boundary
instead of swallowing the sentinel into the URL token's value.

Refs cosai-oasis#247.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
@davidlabianca davidlabianca self-assigned this May 1, 2026
@davidlabianca davidlabianca added enhancement New feature or request infrastructure labels May 1, 2026
@davidlabianca davidlabianca marked this pull request as ready for review May 1, 2026 20:27
@shrey-bagga
Copy link
Copy Markdown
Contributor

Status: ITERATE

Summary:
The new prose linters should cover the full riskmap.schema.json#/definitions/utils/text shape: array<string | array<string>>. Today both wrappers inspect only top-level strings, so valid nested prose strings already present in the corpus are skipped.

Required change:

  • major: Apply the full nested-array traversal fix below. It adds test-first coverage for both wrappers, then updates both find_prose_fields() implementations to yield strings from valid nested prose arrays.

Complete proposed patch:

diff --git a/scripts/hooks/tests/test_validate_yaml_prose_subset.py b/scripts/hooks/tests/test_validate_yaml_prose_subset.py
--- a/scripts/hooks/tests/test_validate_yaml_prose_subset.py
+++ b/scripts/hooks/tests/test_validate_yaml_prose_subset.py
@@
     def test_prose_field_index_reflects_paragraph_position(self, tmp_path):
         r"""
         ProseField.index correctly identifies the paragraph index within the array.
@@
         fields = [f for f in find_prose_fields(yaml_path, schema_dir) if f.field_name == "shortDescription"]
         indices = {f.index for f in fields}
         assert {0, 1, 2} <= indices
 
+    def test_nested_array_prose_items_are_linted(self, tmp_path):
+        r"""
+        Nested strings inside a utils/text prose array are yielded and checked.
+
+        Given: A prose field containing a nested array item with raw HTML
+        When: find_prose_fields and check_prose_field run
+        Then: The nested string produces a subset diagnostic
+        """
+        schema_dir = tmp_path / "schemas"
+        schema_dir.mkdir()
+        prose_ref = {"$ref": "riskmap.schema.json#/definitions/utils/text"}
+        _write_mock_schema(schema_dir, "risk", ["riskAlpha"], extra_props={"shortDescription": prose_ref})
+        yaml_path = _write_yaml(
+            tmp_path,
+            "risks.yaml",
+            {
+                "risks": [
+                    {
+                        "id": "riskAlpha",
+                        "title": "Alpha",
+                        "shortDescription": ["Top prose.", ["Nested <strong>HTML</strong>."]],
+                    }
+                ]
+            },
+        )
+
+        diagnostics = []
+        for field in find_prose_fields(yaml_path, schema_dir):
+            diagnostics.extend(check_prose_field(field))
+
+        assert any("<strong>" in diag.reason for diag in diagnostics)
+
     def test_new_prose_field_in_schema_automatically_discovered(self, tmp_path):
         r"""
         A new prose field added to the schema is discovered without code changes.
diff --git a/scripts/hooks/tests/test_validate_prose_references.py b/scripts/hooks/tests/test_validate_prose_references.py
--- a/scripts/hooks/tests/test_validate_prose_references.py
+++ b/scripts/hooks/tests/test_validate_prose_references.py
@@
     def test_paragraph_index_matches_array_position(self, tmp_path):
         r"""
         ProseField.index matches the paragraph's position in the prose array.
@@
         fields = [f for f in find_prose_fields(yaml_path, schema_dir) if f.field_name == "description"]
         assert {0, 1} <= {f.index for f in fields}
 
+    def test_nested_array_prose_items_are_checked_for_references(self, tmp_path):
+        r"""
+        Nested strings inside a utils/text prose array are yielded and reference-checked.
+
+        Given: A prose field containing a nested array item with a bare risk ID
+        When: find_prose_fields and check_references run
+        Then: The nested string produces a reference diagnostic
+        """
+        schema_dir = tmp_path / "schemas"
+        schema_dir.mkdir()
+        _write_mock_schema(schema_dir, "risk", ["riskAlpha"])
+        yaml_path = _write_yaml(
+            tmp_path,
+            "risks.yaml",
+            {
+                "risks": [
+                    {
+                        "id": "riskAlpha",
+                        "title": "Alpha",
+                        "description": ["Top prose.", ["Nested bare riskBeta."]],
+                    }
+                ]
+            },
+        )
+
+        diagnostics = []
+        for field in find_prose_fields(yaml_path, schema_dir):
+            diagnostics.extend(check_references(field, _make_index(risks=["riskAlpha"])))
+
+        assert any("riskBeta" in diag.reason for diag in diagnostics)
+
     def test_nested_object_prose_fields_discovered(self, tmp_path):
         r"""
         Nested prose fields in object properties (e.g. tourContent.introduced) are
diff --git a/scripts/hooks/precommit/validate_yaml_prose_subset.py b/scripts/hooks/precommit/validate_yaml_prose_subset.py
--- a/scripts/hooks/precommit/validate_yaml_prose_subset.py
+++ b/scripts/hooks/precommit/validate_yaml_prose_subset.py
@@
 def _collect_entries(data: dict, schema: dict) -> Iterator[tuple[str, dict]]:
     """Yield (array_key, entry_dict) pairs from a YAML document.
@@
                 if isinstance(entry, dict):
                     yield key, entry
 
 
+def _iter_prose_strings(field_value: object) -> Iterator[tuple[int, str]]:
+    """Yield strings from the schema's utils/text shape.
+
+    The shared prose schema permits one nesting level:
+    array<string | array<string>>.
+    """
+    if isinstance(field_value, str):
+        yield 0, field_value
+        return
+
+    if not isinstance(field_value, list):
+        return
+
+    for idx, item in enumerate(field_value):
+        if isinstance(item, str):
+            yield idx, item
+        elif isinstance(item, list):
+            for nested_item in item:
+                if isinstance(nested_item, str):
+                    yield idx, nested_item
+
+
 def find_prose_fields(yaml_path: Path, schema_dir: Path) -> Iterator[ProseField]:
     """Yield ProseField objects for every prose array element in a YAML file.
@@
             if field_value is None:
                 continue
-            if isinstance(field_value, list):
-                for idx, item in enumerate(field_value):
-                    if isinstance(item, str):
-                        yield ProseField(
-                            file_path=yaml_path,
-                            entry_id=entry_id,
-                            field_name=field_name,
-                            index=idx,
-                            raw_text=item,
-                            tokens=tokenize(item),
-                        )
-            elif isinstance(field_value, str):
+            for idx, raw_text in _iter_prose_strings(field_value):
                 yield ProseField(
                     file_path=yaml_path,
                     entry_id=entry_id,
                     field_name=field_name,
-                    index=0,
-                    raw_text=field_value,
-                    tokens=tokenize(field_value),
+                    index=idx,
+                    raw_text=raw_text,
+                    tokens=tokenize(raw_text),
                 )
diff --git a/scripts/hooks/precommit/validate_prose_references.py b/scripts/hooks/precommit/validate_prose_references.py
--- a/scripts/hooks/precommit/validate_prose_references.py
+++ b/scripts/hooks/precommit/validate_prose_references.py
@@
 def _collect_entries(data: dict, schema: dict) -> Iterator[tuple[str, dict]]:
     """Yield (array_key, entry_dict) pairs from a YAML document.
@@
                 if isinstance(entry, dict):
                     yield key, entry
 
 
+def _iter_prose_strings(field_value: object) -> Iterator[tuple[int, str]]:
+    """Yield strings from the schema's utils/text shape.
+
+    The shared prose schema permits one nesting level:
+    array<string | array<string>>.
+    """
+    if isinstance(field_value, str):
+        yield 0, field_value
+        return
+
+    if not isinstance(field_value, list):
+        return
+
+    for idx, item in enumerate(field_value):
+        if isinstance(item, str):
+            yield idx, item
+        elif isinstance(item, list):
+            for nested_item in item:
+                if isinstance(nested_item, str):
+                    yield idx, nested_item
+
+
 def find_prose_fields(yaml_path: Path, schema_dir: Path) -> Iterator[ProseField]:
     """Yield ProseField objects for every prose array element in a YAML file.
@@
             if field_value is None:
                 continue
-            if isinstance(field_value, list):
-                for idx, item in enumerate(field_value):
-                    if isinstance(item, str):
-                        yield ProseField(
-                            file_path=yaml_path,
-                            entry_id=entry_id,
-                            field_name=field_name,
-                            index=idx,
-                            raw_text=item,
-                            tokens=tokenize(item),
-                        )
-            elif isinstance(field_value, str):
+            for idx, raw_text in _iter_prose_strings(field_value):
                 yield ProseField(
                     file_path=yaml_path,
                     entry_id=entry_id,
                     field_name=field_name,
-                    index=0,
-                    raw_text=field_value,
-                    tokens=tokenize(field_value),
+                    index=idx,
+                    raw_text=raw_text,
+                    tokens=tokenize(raw_text),
                 )

This keeps the current diagnostic format compatible by using the outer prose index for nested items. A more precise nested location format such as [3][0] would also work, but that would need a coordinated update to the shared ProseField / Diagnostic type annotations and diagnostic-format tests.

Verification I ran on the current PR branch:

  • test_prose_tokens.py, test_validate_yaml_prose_subset.py, and test_validate_prose_references.py: 326 passed

Next steps:

  • testing: land the two regression tests above.
  • swe: land the traversal helper changes above.
  • code-reviewer: rerun the focused linter tests and re-review.

Copy link
Copy Markdown
Contributor

@shrey-bagga shrey-bagga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposed some changes

@davidlabianca
Copy link
Copy Markdown
Contributor Author

Thanks @shrey-bagga - great catch... working on an iteration and will update the PR when complete.

davidlabianca and others added 3 commits May 2, 2026 20:35
This commit adds 30 behavioral tests in scripts/hooks/tests/test_prose_field_shape_coverage.py characterizing the full set of YAML shapes that riskmap.schema.json#/definitions/utils/text permits (bare string, flat array, pure nested array, mixed, file-level wrapper) and asserting both linters yield ProseField records for inner-list strings and wrapper-level prose. Tests are intentionally red against this commit; the next commit closes the gap.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…per prose

This commit lifts find_prose_fields, _collect_entries, _infer_schema_name, and ProseField construction from both prose linters into scripts/hooks/precommit/_prose_fields.py (mirrors the _prose_tokens.py pattern); adds a nested-list traversal branch that emits one ProseField per inner string with index=outer_idx and the new nested_index=inner_idx; admits file-level wrapper prose fields whose schema $ref resolves to utils/text using yaml_path.stem as the synthetic entry_id; turns Phase 1 red harness green (30/30) and removes ~400 lines of duplicated iteration code from the wrappers.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…try assert

This commit adds TestShapeOverDeepNesting documenting the schema's one-level nesting limit as a tested invariant ([[["x"]]] yields zero ProseFields, no recursion past depth 1) and replaces TestLinterSymmetry's six parametrised count tests with a single structural identity assertion (subset_module.find_prose_fields is references_module.find_prose_fields) since both wrappers now re-export the same shared helper from _prose_fields. Drops the unused _SHAPE_FACTORIES list. Net -3 test instances, same coverage signal.

Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
@davidlabianca
Copy link
Copy Markdown
Contributor Author

Addressed the iteration request... @shrey-bagga ptal

@davidlabianca davidlabianca requested a review from shrey-bagga May 2, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conformance sweep A3: prose tokenizer + subset/ref linters (warn-only)

2 participants