Conformance sweep A3: prose tokenizer + linters + ADR amendments#262
Conformance sweep A3: prose tokenizer + linters + ADR amendments#262davidlabianca wants to merge 8 commits intocosai-oasis:mainfrom
Conversation
Single source of truth for the YAML prose grammar; consumed by the forthcoming validate_yaml_prose_subset.py and validate_prose_references.py linters. 106 tests + 42 fixture pairs lock the contract; partition-of-input invariant verified across every fixture. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
Both wrappers (ADR-017 D4 / ADR-016 D6) consume the shared _prose_tokens.py tokenizer; emit `<reason> at <token-snippet>` diagnostics; --block toggle defers to C2. Schema-driven prose-field discovery walks nested objects (tourContent). 154 tests; suite 1584/6. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…nly) Adds two .pre-commit-config.yaml slots invoking the wrappers from be1313b on every commit touching the four content YAMLs. Both ship warn-only (no --block); diagnostics emit to stderr; exit 0 unless C2 flips them to block. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
… D4) Amend ADR-016 D2 sentinel grammar to [A-Za-z0-9_.\-]+ (the dot supports canonical sub-technique IDs like AML.T0040.001 used as externalReferences[].id; entity-prefix forms remain camelCase). Reframe ADR-016 D4 to state the no-inline-URL rule categorically and defer the tokenizer-level shape to ADR-017 D4 rule 2. Reshape ADR-017 D4 rule 2 from a 3-form enumeration to a categorical pair: a RFC-3986 scheme-with-authority regex (\b[a-z][a-z0-9+.\-]*://\S+) plus an opaque-data named list for colon-only schemes that lack // (mailto:, javascript:, data:, tel:); the markdown-link suffix rule is preserved. Refs cosai-oasis#247. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
Extend _prose_tokens.py to enforce the categorical inline-URL rule per
ADR-017 D4 rule 2: a primary RFC-3986 scheme-with-authority regex
\b[a-z][a-z0-9+.\-]*://[^\s{]+ catching every authority-bearing scheme
(http, https, ftp, file, gs, s3, ssh, git+https, etc.) plus an opaque-data
named list (mailto:/javascript:/data:/tel:). Both regexes are case-insensitive
(re.IGNORECASE). The [^\s{]+ stop-at-brace variant preserves sentinel
precedence — URLs adjacent to {{ref:...}} stop at the brace boundary
instead of swallowing the sentinel into the URL token's value.
Refs cosai-oasis#247.
Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
|
Status: ITERATE Summary: Required change:
Complete proposed patch: diff --git a/scripts/hooks/tests/test_validate_yaml_prose_subset.py b/scripts/hooks/tests/test_validate_yaml_prose_subset.py
--- a/scripts/hooks/tests/test_validate_yaml_prose_subset.py
+++ b/scripts/hooks/tests/test_validate_yaml_prose_subset.py
@@
def test_prose_field_index_reflects_paragraph_position(self, tmp_path):
r"""
ProseField.index correctly identifies the paragraph index within the array.
@@
fields = [f for f in find_prose_fields(yaml_path, schema_dir) if f.field_name == "shortDescription"]
indices = {f.index for f in fields}
assert {0, 1, 2} <= indices
+ def test_nested_array_prose_items_are_linted(self, tmp_path):
+ r"""
+ Nested strings inside a utils/text prose array are yielded and checked.
+
+ Given: A prose field containing a nested array item with raw HTML
+ When: find_prose_fields and check_prose_field run
+ Then: The nested string produces a subset diagnostic
+ """
+ schema_dir = tmp_path / "schemas"
+ schema_dir.mkdir()
+ prose_ref = {"$ref": "riskmap.schema.json#/definitions/utils/text"}
+ _write_mock_schema(schema_dir, "risk", ["riskAlpha"], extra_props={"shortDescription": prose_ref})
+ yaml_path = _write_yaml(
+ tmp_path,
+ "risks.yaml",
+ {
+ "risks": [
+ {
+ "id": "riskAlpha",
+ "title": "Alpha",
+ "shortDescription": ["Top prose.", ["Nested <strong>HTML</strong>."]],
+ }
+ ]
+ },
+ )
+
+ diagnostics = []
+ for field in find_prose_fields(yaml_path, schema_dir):
+ diagnostics.extend(check_prose_field(field))
+
+ assert any("<strong>" in diag.reason for diag in diagnostics)
+
def test_new_prose_field_in_schema_automatically_discovered(self, tmp_path):
r"""
A new prose field added to the schema is discovered without code changes.
diff --git a/scripts/hooks/tests/test_validate_prose_references.py b/scripts/hooks/tests/test_validate_prose_references.py
--- a/scripts/hooks/tests/test_validate_prose_references.py
+++ b/scripts/hooks/tests/test_validate_prose_references.py
@@
def test_paragraph_index_matches_array_position(self, tmp_path):
r"""
ProseField.index matches the paragraph's position in the prose array.
@@
fields = [f for f in find_prose_fields(yaml_path, schema_dir) if f.field_name == "description"]
assert {0, 1} <= {f.index for f in fields}
+ def test_nested_array_prose_items_are_checked_for_references(self, tmp_path):
+ r"""
+ Nested strings inside a utils/text prose array are yielded and reference-checked.
+
+ Given: A prose field containing a nested array item with a bare risk ID
+ When: find_prose_fields and check_references run
+ Then: The nested string produces a reference diagnostic
+ """
+ schema_dir = tmp_path / "schemas"
+ schema_dir.mkdir()
+ _write_mock_schema(schema_dir, "risk", ["riskAlpha"])
+ yaml_path = _write_yaml(
+ tmp_path,
+ "risks.yaml",
+ {
+ "risks": [
+ {
+ "id": "riskAlpha",
+ "title": "Alpha",
+ "description": ["Top prose.", ["Nested bare riskBeta."]],
+ }
+ ]
+ },
+ )
+
+ diagnostics = []
+ for field in find_prose_fields(yaml_path, schema_dir):
+ diagnostics.extend(check_references(field, _make_index(risks=["riskAlpha"])))
+
+ assert any("riskBeta" in diag.reason for diag in diagnostics)
+
def test_nested_object_prose_fields_discovered(self, tmp_path):
r"""
Nested prose fields in object properties (e.g. tourContent.introduced) are
diff --git a/scripts/hooks/precommit/validate_yaml_prose_subset.py b/scripts/hooks/precommit/validate_yaml_prose_subset.py
--- a/scripts/hooks/precommit/validate_yaml_prose_subset.py
+++ b/scripts/hooks/precommit/validate_yaml_prose_subset.py
@@
def _collect_entries(data: dict, schema: dict) -> Iterator[tuple[str, dict]]:
"""Yield (array_key, entry_dict) pairs from a YAML document.
@@
if isinstance(entry, dict):
yield key, entry
+def _iter_prose_strings(field_value: object) -> Iterator[tuple[int, str]]:
+ """Yield strings from the schema's utils/text shape.
+
+ The shared prose schema permits one nesting level:
+ array<string | array<string>>.
+ """
+ if isinstance(field_value, str):
+ yield 0, field_value
+ return
+
+ if not isinstance(field_value, list):
+ return
+
+ for idx, item in enumerate(field_value):
+ if isinstance(item, str):
+ yield idx, item
+ elif isinstance(item, list):
+ for nested_item in item:
+ if isinstance(nested_item, str):
+ yield idx, nested_item
+
+
def find_prose_fields(yaml_path: Path, schema_dir: Path) -> Iterator[ProseField]:
"""Yield ProseField objects for every prose array element in a YAML file.
@@
if field_value is None:
continue
- if isinstance(field_value, list):
- for idx, item in enumerate(field_value):
- if isinstance(item, str):
- yield ProseField(
- file_path=yaml_path,
- entry_id=entry_id,
- field_name=field_name,
- index=idx,
- raw_text=item,
- tokens=tokenize(item),
- )
- elif isinstance(field_value, str):
+ for idx, raw_text in _iter_prose_strings(field_value):
yield ProseField(
file_path=yaml_path,
entry_id=entry_id,
field_name=field_name,
- index=0,
- raw_text=field_value,
- tokens=tokenize(field_value),
+ index=idx,
+ raw_text=raw_text,
+ tokens=tokenize(raw_text),
)
diff --git a/scripts/hooks/precommit/validate_prose_references.py b/scripts/hooks/precommit/validate_prose_references.py
--- a/scripts/hooks/precommit/validate_prose_references.py
+++ b/scripts/hooks/precommit/validate_prose_references.py
@@
def _collect_entries(data: dict, schema: dict) -> Iterator[tuple[str, dict]]:
"""Yield (array_key, entry_dict) pairs from a YAML document.
@@
if isinstance(entry, dict):
yield key, entry
+def _iter_prose_strings(field_value: object) -> Iterator[tuple[int, str]]:
+ """Yield strings from the schema's utils/text shape.
+
+ The shared prose schema permits one nesting level:
+ array<string | array<string>>.
+ """
+ if isinstance(field_value, str):
+ yield 0, field_value
+ return
+
+ if not isinstance(field_value, list):
+ return
+
+ for idx, item in enumerate(field_value):
+ if isinstance(item, str):
+ yield idx, item
+ elif isinstance(item, list):
+ for nested_item in item:
+ if isinstance(nested_item, str):
+ yield idx, nested_item
+
+
def find_prose_fields(yaml_path: Path, schema_dir: Path) -> Iterator[ProseField]:
"""Yield ProseField objects for every prose array element in a YAML file.
@@
if field_value is None:
continue
- if isinstance(field_value, list):
- for idx, item in enumerate(field_value):
- if isinstance(item, str):
- yield ProseField(
- file_path=yaml_path,
- entry_id=entry_id,
- field_name=field_name,
- index=idx,
- raw_text=item,
- tokens=tokenize(item),
- )
- elif isinstance(field_value, str):
+ for idx, raw_text in _iter_prose_strings(field_value):
yield ProseField(
file_path=yaml_path,
entry_id=entry_id,
field_name=field_name,
- index=0,
- raw_text=field_value,
- tokens=tokenize(field_value),
+ index=idx,
+ raw_text=raw_text,
+ tokens=tokenize(raw_text),
)This keeps the current diagnostic format compatible by using the outer prose index for nested items. A more precise nested location format such as Verification I ran on the current PR branch:
Next steps:
|
shrey-bagga
left a comment
There was a problem hiding this comment.
proposed some changes
|
Thanks @shrey-bagga - great catch... working on an iteration and will update the PR when complete. |
This commit adds 30 behavioral tests in scripts/hooks/tests/test_prose_field_shape_coverage.py characterizing the full set of YAML shapes that riskmap.schema.json#/definitions/utils/text permits (bare string, flat array, pure nested array, mixed, file-level wrapper) and asserting both linters yield ProseField records for inner-list strings and wrapper-level prose. Tests are intentionally red against this commit; the next commit closes the gap. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…per prose This commit lifts find_prose_fields, _collect_entries, _infer_schema_name, and ProseField construction from both prose linters into scripts/hooks/precommit/_prose_fields.py (mirrors the _prose_tokens.py pattern); adds a nested-list traversal branch that emits one ProseField per inner string with index=outer_idx and the new nested_index=inner_idx; admits file-level wrapper prose fields whose schema $ref resolves to utils/text using yaml_path.stem as the synthetic entry_id; turns Phase 1 red harness green (30/30) and removes ~400 lines of duplicated iteration code from the wrappers. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
…try assert This commit adds TestShapeOverDeepNesting documenting the schema's one-level nesting limit as a tested invariant ([[["x"]]] yields zero ProseFields, no recursion past depth 1) and replaces TestLinterSymmetry's six parametrised count tests with a single structural identity assertion (subset_module.find_prose_fields is references_module.find_prose_fields) since both wrappers now re-export the same shared helper from _prose_fields. Drops the unused _SHAPE_FACTORIES list. Net -3 test instances, same coverage signal. Co-authored-by: AI Assistant <ai-assistant@coalitionforsecureai.org>
|
Addressed the iteration request... @shrey-bagga ptal |
Conformance sweep A3: prose tokenizer + linters + ADR amendments
Closes #247 (see also scope addendum comment)
Folds in the prose-shape coverage fix triggered by #262 review (shrey-bagga, ITERATE) — investigation, decisions, and verification harness confirmed the issue and executed fix.
Summary
scripts/hooks/precommit/_prose_tokens.py) per ADR-017 D5 — single source of truth for the YAML prose grammar with a partition-of-input invariant locked across 42 fixture pairs + 122 unit tests.validate_yaml_prose_subset.pyper ADR-017 D4 — accepts the canonical authoring subset (**bold**,*italic*/_italic_, sentinels) and rejects out-of-subset productions (inline URLs, raw HTML tags, markdown lists/headings/code/images/blockquotes/tables, folded-bullet drift heuristic per ADR-020 D4)validate_prose_references.pyper ADR-016 D6 — resolves{{idXxx}}and{{ref:identifier}}sentinels against schema enums +externalReferences[].id; rejects bare-camelCase IDs in prose; rejects raw inline URLs.pre-commit-config.yamlas warn-only hooks; emit<reason> at <token-snippet>diagnostics;--blocktoggle deferred to the sweep-closing commit.riskmap.schema.json#/definitions/utils/text(array<string | array<string>>). Liftsfind_prose_fields/_collect_entries/_infer_schema_name/ProseFieldinto a new sharedscripts/hooks/precommit/_prose_fields.py(mirrors_prose_tokens.pypattern), adds nested-list traversal, and admits file-level wrapperdescriptionfields that live outside the entity array. Newnested_index: int | Nonefield onProseFieldkeeps the existing diagnostic format additive. Verified by 27 behavioral tests inscripts/hooks/tests/test_prose_field_shape_coverage.py(TDD red→green) covering 5 corpus sites today (3 nested-list inrisks.yaml, 2 file-level wrapper incomponents.yaml+risks.yaml); duplication across the two wrappers reduced by ~400 lines.Commit-by-commit
e9f5301feat(precommit): add shared _prose_tokens.py tokenizer (ADR-017 D5)af882ecfeat(precommit): add prose-subset + prose-references warn-only linters467d5f4feat(precommit): wire prose-subset and prose-references hooks (warn-only).pre-commit-config.yaml697c76fchore(adr): ADR-016 D2 grammar + categorical inline-URL rule (016/017 D4)0c4d4fafeat(precommit): extend tokenizer to enforce categorical inline-URL rulee1fe955test(precommit): add prose-field shape-coverage probe (TDD red)72ad4aefeat(precommit): extract _prose_fields.py helper; admit nested + wrapper prosef2abe12test(precommit): add over-deep nesting probe; structural linter-symmetry assertArchitect-tagged vs. SWE-tagged commits
Commit 4 (
chore(adr):prefix) is architect-authored ADR text edits todocs/adr/016-reference-strategy.mdanddocs/adr/017-yaml-prose-authoring-subset.md. Commits 1, 2, 3, and 5 (feat(precommit):prefix) are SWE-authored implementation. The two work-types are intentionally distinct commits to keep agent-attribution legible. The overall PR description (this body) leads with the expanded scope so reviewer expectations are set up-front.Commits 6, 7, and 8 land the prose-shape coverage fix in TDD red→green→refactor order: commit 6 (testing) authors a behavioral test file that is intentionally RED against HEAD-of-commit-6 (no source change yet); commit 7 (SWE) lifts the duplicated
find_prose_fieldsblock into_prose_fields.py, adds the nested-list branch and the file-level wrapper branch, and re-imports the helper into both wrappers — the test file becomes green at HEAD-of-commit-7; commit 8 (testing) adds an over-deep nesting probe and replaces the parametrised linter-symmetry suite with a single structural identity assert (since both wrappers now share the helper, symmetry is structural, not behavioral). The split makes the gap measurable as a real before/after diff and keeps testing/SWE attribution legible across the fix.Scope boundaries
In scope (this PR):
scripts/hooks/precommit/_prose_tokens.py+_linter_types.py(shared tokenizer + types)scripts/hooks/precommit/validate_yaml_prose_subset.py+validate_prose_references.py(warn-only linters).pre-commit-config.yaml(2 new hook slots)docs/adr/016-reference-strategy.mdD2 + D4 (sentinel grammar; categorical URL rule reframing)docs/adr/017-yaml-prose-authoring-subset.mdD4 rule 2 (3-form enumeration → categorical regex + opaque named list)scripts/hooks/tests/test_prose_tokens.py,test_validate_yaml_prose_subset.py,test_validate_prose_references.py+fixtures/prose_subset/(42 pairs) +fixtures/wrapper_linters/(23 pairs)scripts/hooks/precommit/_prose_fields.py— shared helper hostingProseField,find_prose_fields,_collect_entries,_infer_schema_name,_iter_prose_strings,_find_wrapper_prose_field_names_in_schemavalidate_yaml_prose_subset.py+validate_prose_references.py— replace localfind_prose_fieldsblocks withfrom ._prose_fields import …(eliminates ~400 lines of duplicated iteration code from the two wrappers)scripts/hooks/tests/test_prose_field_shape_coverage.py— behavioral coverage probe across all fiveutils/textshapes plus over-deep nesting and structural-symmetry checks (27 tests total)Out of scope (deferred):
Test plan
pytest scripts/hooks/tests/test_prose_tokens.py— 122 tokenizer tests + 50 categorical-rule tests = 172 total, all greenpytest scripts/hooks/tests/test_validate_yaml_prose_subset.py— wrapper tests passpytest scripts/hooks/tests/test_validate_prose_references.py— wrapper tests passpytest scripts/hooks/tests/test_prose_field_shape_coverage.py -v— 27 instances green (post-flip behavioral assertions across 6 shape classes including newTestShapeWrapperDescriptionandTestShapeOverDeepNesting; structural symmetry replacement)pytestreports 1957 pass / 6 skip (post-cascade baseline; A1+A3 + shape-coverage)ruff check . && ruff format --check .— cleanpre-commit run --all-files— runs warn-only against current corpus; emits ~177 subset diagnostics + ~170 references diagnostics (was ~144 + ~150 pre-shape-fix; +33 / +20 from the new nested-list traversal — warn-only is intentional during the sweep; corpus migration follows in B1/B2)find_prose_fieldsfrom_prose_fields(no local copies remain in either wrapper);subset_module.find_prose_fields is references_module.find_prose_fieldsis enforced structurally byTestLinterSymmetryrisk-map/yaml/risks.yaml:706, :981, :1245(nested-list<strong>HTML) emit diagnostics;risk-map/yaml/components.yaml:19,risk-map/yaml/risks.yaml:24(file-level wrapperdescription) walk under wrapperentry_id = <yaml-stem>and tokenize cleanly (clean prose, no diagnostics)ftp:///gopher:///mailto:/javascript:/data:/tel:(catches authority-bearing schemes via the scheme-with-authority regex, opaque schemes via the named list); accepts**bold**/*italic*/{{idRiskPromptInjection}}/{{ref:cwe-89}}{{ref:cwe-89}}does not swallow the sentinel into the URL token'svalue[A-Za-z0-9_.\-]+matches the schema regex set; ADR-017 D4 rule 2 prose describes the categorical pair (regex + named list); ADR-016 D4 reframed to defer the tokenizer-level shape to ADR-017 D4 rule 2Co-Authored-By: AI Assistant ai-assistant@coalitionforsecureai.org