Conformance sweep A3: prose tokenizer + linters + ADR amendments by davidlabianca · Pull Request #262 · cosai-oasis/secure-ai-tooling

davidlabianca · 2026-05-01T20:22:45Z

Conformance sweep A3: prose tokenizer + linters + ADR amendments

Closes #247 (see also scope addendum comment)

Folds in the prose-shape coverage fix triggered by #262 review (shrey-bagga, ITERATE) — investigation, decisions, and verification harness confirmed the issue and executed fix.

Summary

Authors a shared prose tokenizer (scripts/hooks/precommit/_prose_tokens.py) per ADR-017 D5 — single source of truth for the YAML prose grammar with a partition-of-input invariant locked across 42 fixture pairs + 122 unit tests.
Adds two consumer linters that ride the shared tokenizer:
- validate_yaml_prose_subset.py per ADR-017 D4 — accepts the canonical authoring subset (**bold**, *italic*/_italic_, sentinels) and rejects out-of-subset productions (inline URLs, raw HTML tags, markdown lists/headings/code/images/blockquotes/tables, folded-bullet drift heuristic per ADR-020 D4)
- validate_prose_references.py per ADR-016 D6 — resolves {{idXxx}} and {{ref:identifier}} sentinels against schema enums + externalReferences[].id; rejects bare-camelCase IDs in prose; rejects raw inline URLs
Both linters ship in .pre-commit-config.yaml as warn-only hooks; emit <reason> at <token-snippet> diagnostics; --block toggle deferred to the sweep-closing commit.
Folds in ADR-016 / ADR-017 amendments (originally plan-tracked at § 2.7.2 + § 2.7.3) — the divergences were surfaced by A3 implementation and the tokenizer file lives in this branch, so co-locating amendments + impl in one PR makes end-to-end coherence reviewable. § 2.7 collapsed back to just 2.7.1 (block-flip).
Closes prose-field shape-coverage gap surfaced in Conformance sweep A3: prose tokenizer + linters + ADR amendments #262 review (shrey-bagga, ITERATE): both linters under-covered riskmap.schema.json#/definitions/utils/text (array<string | array<string>>). Lifts find_prose_fields / _collect_entries / _infer_schema_name / ProseField into a new shared scripts/hooks/precommit/_prose_fields.py (mirrors _prose_tokens.py pattern), adds nested-list traversal, and admits file-level wrapper description fields that live outside the entity array. New nested_index: int | None field on ProseField keeps the existing diagnostic format additive. Verified by 27 behavioral tests in scripts/hooks/tests/test_prose_field_shape_coverage.py (TDD red→green) covering 5 corpus sites today (3 nested-list in risks.yaml, 2 file-level wrapper in components.yaml + risks.yaml); duplication across the two wrappers reduced by ~400 lines.
8 commits, 147 files, +7918 / -5; 1957 pass / 6 skip; ruff lint + format-check clean. Smoke test on the live corpus reports 177 subset + 170 references warn-only diagnostics (was 144 + 150 pre-shape-fix; +33 / +20 from the new traversal paths) — expected and intentional; the corpus migrates to canonical form in later content-migration sub-PRs.

Commit-by-commit

#	SHA	Subject	Role
1	`e9f5301`	`feat(precommit): add shared _prose_tokens.py tokenizer (ADR-017 D5)`	SWE — tokenizer module
2	`af882ec`	`feat(precommit): add prose-subset + prose-references warn-only linters`	SWE — wrapper linters
3	`467d5f4`	`feat(precommit): wire prose-subset and prose-references hooks (warn-only)`	SWE — `.pre-commit-config.yaml`
4	`697c76f`	`chore(adr): ADR-016 D2 grammar + categorical inline-URL rule (016/017 D4)`	architect — ADR amendments
5	`0c4d4fa`	`feat(precommit): extend tokenizer to enforce categorical inline-URL rule`	SWE — tokenizer extension implementing commit 4's categorical rule
6	`e1fe955`	`test(precommit): add prose-field shape-coverage probe (TDD red)`	testing — behavioral red harness for shape-coverage gap
7	`72ad4ae`	`feat(precommit): extract _prose_fields.py helper; admit nested + wrapper prose`	SWE — TDD green; closes #262 review gap
8	`f2abe12`	`test(precommit): add over-deep nesting probe; structural linter-symmetry assert`	testing — pin one-level invariant; replace symmetry probes with structural identity

Architect-tagged vs. SWE-tagged commits

Commit 4 (chore(adr): prefix) is architect-authored ADR text edits to docs/adr/016-reference-strategy.md and docs/adr/017-yaml-prose-authoring-subset.md. Commits 1, 2, 3, and 5 (feat(precommit): prefix) are SWE-authored implementation. The two work-types are intentionally distinct commits to keep agent-attribution legible. The overall PR description (this body) leads with the expanded scope so reviewer expectations are set up-front.

Commits 6, 7, and 8 land the prose-shape coverage fix in TDD red→green→refactor order: commit 6 (testing) authors a behavioral test file that is intentionally RED against HEAD-of-commit-6 (no source change yet); commit 7 (SWE) lifts the duplicated find_prose_fields block into _prose_fields.py, adds the nested-list branch and the file-level wrapper branch, and re-imports the helper into both wrappers — the test file becomes green at HEAD-of-commit-7; commit 8 (testing) adds an over-deep nesting probe and replaces the parametrised linter-symmetry suite with a single structural identity assert (since both wrappers now share the helper, symmetry is structural, not behavioral). The split makes the gap measurable as a real before/after diff and keeps testing/SWE attribution legible across the fix.

Scope boundaries

In scope (this PR):

New: scripts/hooks/precommit/_prose_tokens.py + _linter_types.py (shared tokenizer + types)
New: scripts/hooks/precommit/validate_yaml_prose_subset.py + validate_prose_references.py (warn-only linters)
Edit: .pre-commit-config.yaml (2 new hook slots)
Edit: docs/adr/016-reference-strategy.md D2 + D4 (sentinel grammar; categorical URL rule reframing)
Edit: docs/adr/017-yaml-prose-authoring-subset.md D4 rule 2 (3-form enumeration → categorical regex + opaque named list)
New tests: scripts/hooks/tests/test_prose_tokens.py, test_validate_yaml_prose_subset.py, test_validate_prose_references.py + fixtures/prose_subset/ (42 pairs) + fixtures/wrapper_linters/ (23 pairs)
New: scripts/hooks/precommit/_prose_fields.py — shared helper hosting ProseField, find_prose_fields, _collect_entries, _infer_schema_name, _iter_prose_strings, _find_wrapper_prose_field_names_in_schema
Edit: validate_yaml_prose_subset.py + validate_prose_references.py — replace local find_prose_fields blocks with from ._prose_fields import … (eliminates ~400 lines of duplicated iteration code from the two wrappers)
New tests: scripts/hooks/tests/test_prose_field_shape_coverage.py — behavioral coverage probe across all five utils/text shapes plus over-deep nesting and structural-symmetry checks (27 tests total)

Out of scope (deferred):

Block-mode flip (warn → block) → the sweep-closing sub-PR (plan § 2.7.1; was § 2.7 with 2.7.2/2.7.3 before this PR folded those in)
Identification-questions lint → A5 (already merged-or-merging at PR Conformance sweep A5: validate-identification-questions lint #244)
YAML content migration → the content-migration sub-PRs (B1, B2)
Schema changes → A2's territory
Validator extensions (lifecycle-stage uniqueness, controls↔components mirror, etc.) → A4
Builder updates (sentinel expansion in generators) → A7

Test plan

Co-Authored-By: AI Assistant ai-assistant@coalitionforsecureai.org

Single source of truth for the YAML prose grammar; consumed by the forthcoming validate_yaml_prose_subset.py and validate_prose_references.py linters. 106 tests + 42 fixture pairs lock the contract; partition-of-input invariant verified across every fixture. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

Both wrappers (ADR-017 D4 / ADR-016 D6) consume the shared _prose_tokens.py tokenizer; emit `<reason> at <token-snippet>` diagnostics; --block toggle defers to C2. Schema-driven prose-field discovery walks nested objects (tourContent). 154 tests; suite 1584/6. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…nly) Adds two .pre-commit-config.yaml slots invoking the wrappers from be1313b on every commit touching the four content YAMLs. Both ship warn-only (no --block); diagnostics emit to stderr; exit 0 unless C2 flips them to block. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

… D4) Amend ADR-016 D2 sentinel grammar to [A-Za-z0-9_.\-]+ (the dot supports canonical sub-technique IDs like AML.T0040.001 used as externalReferences[].id; entity-prefix forms remain camelCase). Reframe ADR-016 D4 to state the no-inline-URL rule categorically and defer the tokenizer-level shape to ADR-017 D4 rule 2. Reshape ADR-017 D4 rule 2 from a 3-form enumeration to a categorical pair: a RFC-3986 scheme-with-authority regex (\b[a-z][a-z0-9+.\-]*://\S+) plus an opaque-data named list for colon-only schemes that lack // (mailto:, javascript:, data:, tel:); the markdown-link suffix rule is preserved. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

Extend _prose_tokens.py to enforce the categorical inline-URL rule per ADR-017 D4 rule 2: a primary RFC-3986 scheme-with-authority regex \b[a-z][a-z0-9+.\-]*://[^\s{]+ catching every authority-bearing scheme (http, https, ftp, file, gs, s3, ssh, git+https, etc.) plus an opaque-data named list (mailto:/javascript:/data:/tel:). Both regexes are case-insensitive (re.IGNORECASE). The [^\s{]+ stop-at-brace variant preserves sentinel precedence — URLs adjacent to {{ref:...}} stop at the brace boundary instead of swallowing the sentinel into the URL token's value. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

shrey-bagga · 2026-05-02T01:41:55Z

Status: ITERATE

Summary:
The new prose linters should cover the full riskmap.schema.json#/definitions/utils/text shape: array<string | array<string>>. Today both wrappers inspect only top-level strings, so valid nested prose strings already present in the corpus are skipped.

Required change:

major: Apply the full nested-array traversal fix below. It adds test-first coverage for both wrappers, then updates both find_prose_fields() implementations to yield strings from valid nested prose arrays.

Complete proposed patch:

diff --git a/scripts/hooks/tests/test_validate_yaml_prose_subset.py b/scripts/hooks/tests/test_validate_yaml_prose_subset.py
--- a/scripts/hooks/tests/test_validate_yaml_prose_subset.py
+++ b/scripts/hooks/tests/test_validate_yaml_prose_subset.py
@@
     def test_prose_field_index_reflects_paragraph_position(self, tmp_path):
         r"""
         ProseField.index correctly identifies the paragraph index within the array.
@@
         fields = [f for f in find_prose_fields(yaml_path, schema_dir) if f.field_name == "shortDescription"]
         indices = {f.index for f in fields}
         assert {0, 1, 2} <= indices
 
+    def test_nested_array_prose_items_are_linted(self, tmp_path):
+        r"""
+        Nested strings inside a utils/text prose array are yielded and checked.
+
+        Given: A prose field containing a nested array item with raw HTML
+        When: find_prose_fields and check_prose_field run
+        Then: The nested string produces a subset diagnostic
+        """
+        schema_dir = tmp_path / "schemas"
+        schema_dir.mkdir()
+        prose_ref = {"$ref": "riskmap.schema.json#/definitions/utils/text"}
+        _write_mock_schema(schema_dir, "risk", ["riskAlpha"], extra_props={"shortDescription": prose_ref})
+        yaml_path = _write_yaml(
+            tmp_path,
+            "risks.yaml",
+            {
+                "risks": [
+                    {
+                        "id": "riskAlpha",
+                        "title": "Alpha",
+                        "shortDescription": ["Top prose.", ["Nested <strong>HTML</strong>."]],
+                    }
+                ]
+            },
+        )
+
+        diagnostics = []
+        for field in find_prose_fields(yaml_path, schema_dir):
+            diagnostics.extend(check_prose_field(field))
+
+        assert any("<strong>" in diag.reason for diag in diagnostics)
+
     def test_new_prose_field_in_schema_automatically_discovered(self, tmp_path):
         r"""
         A new prose field added to the schema is discovered without code changes.
diff --git a/scripts/hooks/tests/test_validate_prose_references.py b/scripts/hooks/tests/test_validate_prose_references.py
--- a/scripts/hooks/tests/test_validate_prose_references.py
+++ b/scripts/hooks/tests/test_validate_prose_references.py
@@
     def test_paragraph_index_matches_array_position(self, tmp_path):
         r"""
         ProseField.index matches the paragraph's position in the prose array.
@@
         fields = [f for f in find_prose_fields(yaml_path, schema_dir) if f.field_name == "description"]
         assert {0, 1} <= {f.index for f in fields}
 
+    def test_nested_array_prose_items_are_checked_for_references(self, tmp_path):
+        r"""
+        Nested strings inside a utils/text prose array are yielded and reference-checked.
+
+        Given: A prose field containing a nested array item with a bare risk ID
+        When: find_prose_fields and check_references run
+        Then: The nested string produces a reference diagnostic
+        """
+        schema_dir = tmp_path / "schemas"
+        schema_dir.mkdir()
+        _write_mock_schema(schema_dir, "risk", ["riskAlpha"])
+        yaml_path = _write_yaml(
+            tmp_path,
+            "risks.yaml",
+            {
+                "risks": [
+                    {
+                        "id": "riskAlpha",
+                        "title": "Alpha",
+                        "description": ["Top prose.", ["Nested bare riskBeta."]],
+                    }
+                ]
+            },
+        )
+
+        diagnostics = []
+        for field in find_prose_fields(yaml_path, schema_dir):
+            diagnostics.extend(check_references(field, _make_index(risks=["riskAlpha"])))
+
+        assert any("riskBeta" in diag.reason for diag in diagnostics)
+
     def test_nested_object_prose_fields_discovered(self, tmp_path):
         r"""
         Nested prose fields in object properties (e.g. tourContent.introduced) are
diff --git a/scripts/hooks/precommit/validate_yaml_prose_subset.py b/scripts/hooks/precommit/validate_yaml_prose_subset.py
--- a/scripts/hooks/precommit/validate_yaml_prose_subset.py
+++ b/scripts/hooks/precommit/validate_yaml_prose_subset.py
@@
 def _collect_entries(data: dict, schema: dict) -> Iterator[tuple[str, dict]]:
     """Yield (array_key, entry_dict) pairs from a YAML document.
@@
                 if isinstance(entry, dict):
                     yield key, entry
 
 
+def _iter_prose_strings(field_value: object) -> Iterator[tuple[int, str]]:
+    """Yield strings from the schema's utils/text shape.
+
+    The shared prose schema permits one nesting level:
+    array<string | array<string>>.
+    """
+    if isinstance(field_value, str):
+        yield 0, field_value
+        return
+
+    if not isinstance(field_value, list):
+        return
+
+    for idx, item in enumerate(field_value):
+        if isinstance(item, str):
+            yield idx, item
+        elif isinstance(item, list):
+            for nested_item in item:
+                if isinstance(nested_item, str):
+                    yield idx, nested_item
+
+
 def find_prose_fields(yaml_path: Path, schema_dir: Path) -> Iterator[ProseField]:
     """Yield ProseField objects for every prose array element in a YAML file.
@@
             if field_value is None:
                 continue
-            if isinstance(field_value, list):
-                for idx, item in enumerate(field_value):
-                    if isinstance(item, str):
-                        yield ProseField(
-                            file_path=yaml_path,
-                            entry_id=entry_id,
-                            field_name=field_name,
-                            index=idx,
-                            raw_text=item,
-                            tokens=tokenize(item),
-                        )
-            elif isinstance(field_value, str):
+            for idx, raw_text in _iter_prose_strings(field_value):
                 yield ProseField(
                     file_path=yaml_path,
                     entry_id=entry_id,
                     field_name=field_name,
-                    index=0,
-                    raw_text=field_value,
-                    tokens=tokenize(field_value),
+                    index=idx,
+                    raw_text=raw_text,
+                    tokens=tokenize(raw_text),
                 )
diff --git a/scripts/hooks/precommit/validate_prose_references.py b/scripts/hooks/precommit/validate_prose_references.py
--- a/scripts/hooks/precommit/validate_prose_references.py
+++ b/scripts/hooks/precommit/validate_prose_references.py
@@
 def _collect_entries(data: dict, schema: dict) -> Iterator[tuple[str, dict]]:
     """Yield (array_key, entry_dict) pairs from a YAML document.
@@
                 if isinstance(entry, dict):
                     yield key, entry
 
 
+def _iter_prose_strings(field_value: object) -> Iterator[tuple[int, str]]:
+    """Yield strings from the schema's utils/text shape.
+
+    The shared prose schema permits one nesting level:
+    array<string | array<string>>.
+    """
+    if isinstance(field_value, str):
+        yield 0, field_value
+        return
+
+    if not isinstance(field_value, list):
+        return
+
+    for idx, item in enumerate(field_value):
+        if isinstance(item, str):
+            yield idx, item
+        elif isinstance(item, list):
+            for nested_item in item:
+                if isinstance(nested_item, str):
+                    yield idx, nested_item
+
+
 def find_prose_fields(yaml_path: Path, schema_dir: Path) -> Iterator[ProseField]:
     """Yield ProseField objects for every prose array element in a YAML file.
@@
             if field_value is None:
                 continue
-            if isinstance(field_value, list):
-                for idx, item in enumerate(field_value):
-                    if isinstance(item, str):
-                        yield ProseField(
-                            file_path=yaml_path,
-                            entry_id=entry_id,
-                            field_name=field_name,
-                            index=idx,
-                            raw_text=item,
-                            tokens=tokenize(item),
-                        )
-            elif isinstance(field_value, str):
+            for idx, raw_text in _iter_prose_strings(field_value):
                 yield ProseField(
                     file_path=yaml_path,
                     entry_id=entry_id,
                     field_name=field_name,
-                    index=0,
-                    raw_text=field_value,
-                    tokens=tokenize(field_value),
+                    index=idx,
+                    raw_text=raw_text,
+                    tokens=tokenize(raw_text),
                 )

This keeps the current diagnostic format compatible by using the outer prose index for nested items. A more precise nested location format such as [3][0] would also work, but that would need a coordinated update to the shared ProseField / Diagnostic type annotations and diagnostic-format tests.

Verification I ran on the current PR branch:

test_prose_tokens.py, test_validate_yaml_prose_subset.py, and test_validate_prose_references.py: 326 passed

Next steps:

testing: land the two regression tests above.
swe: land the traversal helper changes above.
code-reviewer: rerun the focused linter tests and re-review.

shrey-bagga

proposed some changes

davidlabianca · 2026-05-02T20:00:31Z

Thanks @shrey-bagga - great catch... working on an iteration and will update the PR when complete.

This commit adds 30 behavioral tests in scripts/hooks/tests/test_prose_field_shape_coverage.py characterizing the full set of YAML shapes that riskmap.schema.json#/definitions/utils/text permits (bare string, flat array, pure nested array, mixed, file-level wrapper) and asserting both linters yield ProseField records for inner-list strings and wrapper-level prose. Tests are intentionally red against this commit; the next commit closes the gap. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…per prose This commit lifts find_prose_fields, _collect_entries, _infer_schema_name, and ProseField construction from both prose linters into scripts/hooks/precommit/_prose_fields.py (mirrors the _prose_tokens.py pattern); adds a nested-list traversal branch that emits one ProseField per inner string with index=outer_idx and the new nested_index=inner_idx; admits file-level wrapper prose fields whose schema $ref resolves to utils/text using yaml_path.stem as the synthetic entry_id; turns Phase 1 red harness green (30/30) and removes ~400 lines of duplicated iteration code from the wrappers. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

…try assert This commit adds TestShapeOverDeepNesting documenting the schema's one-level nesting limit as a tested invariant ([[["x"]]] yields zero ProseFields, no recursion past depth 1) and replaces TestLinterSymmetry's six parametrised count tests with a single structural identity assertion (subset_module.find_prose_fields is references_module.find_prose_fields) since both wrappers now re-export the same shared helper from _prose_fields. Drops the unused _SHAPE_FACTORIES list. Net -3 test instances, same coverage signal. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>

davidlabianca · 2026-05-02T20:46:46Z

Addressed the iteration request... @shrey-bagga ptal

davidlabianca and others added 5 commits April 30, 2026 21:34

davidlabianca self-assigned this May 1, 2026

davidlabianca added enhancement New feature or request infrastructure labels May 1, 2026

davidlabianca marked this pull request as ready for review May 1, 2026 20:27

davidlabianca requested review from santosomar and shrey-bagga May 1, 2026 20:27

shrey-bagga requested changes May 2, 2026

View reviewed changes

davidlabianca and others added 3 commits May 2, 2026 20:35

davidlabianca requested a review from shrey-bagga May 2, 2026 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conformance sweep A3: prose tokenizer + linters + ADR amendments#262

Conformance sweep A3: prose tokenizer + linters + ADR amendments#262
davidlabianca wants to merge 8 commits intocosai-oasis:mainfrom
davidlabianca:feature/A3-linters-foundation

davidlabianca commented May 1, 2026 •

edited

Loading

Uh oh!

shrey-bagga commented May 2, 2026

Uh oh!

shrey-bagga left a comment

Uh oh!

davidlabianca commented May 2, 2026

Uh oh!

davidlabianca commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davidlabianca commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!