Skip to content

contracts(ccpa): v1.27.0 → v1.28.0 — register FALSIFY-CCPA-017 project_scale_parity_bound (PROPOSED)#1684

Merged
noahgift merged 32 commits into
mainfrom
m190-ccpa017-v1.28.0
May 16, 2026
Merged

contracts(ccpa): v1.27.0 → v1.28.0 — register FALSIFY-CCPA-017 project_scale_parity_bound (PROPOSED)#1684
noahgift merged 32 commits into
mainfrom
m190-ccpa017-v1.28.0

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Adds 1 new falsification gate to `claude-code-parity-apr-v1`: CCPA-017 (project-scale parity bound) at status: PROPOSED. Gate count: 16 → 17.

Why

Phase 4 of the companion repo (paiml/claude-code-parity-apr M180-M188) operationalizes the M159 ProgramBench prior-art (arXiv:2605.03546) into project-scale parity testing. ProgramBench reports 0%/200 fully-resolved across Claude Opus/Sonnet/Haiku + GPT + Gemini — a CCPA-016-style "both pass" assertion is implausible at project-scale. CCPA-017 inverts the question: partial-progress agreement, not all-or-nothing.

DUAL-threshold design

  • `partial_agreement >= 0.3` (≥30% of fixtures see both systems pass)
  • `files_jaccard_corpus >= 0.3` (mean per-fixture files-touched Jaccard ≥ 0.3)

Both orthogonal channels must show agreement. Tentative 0.3/0.3 POC-tier floors; recalibration awaits first operator-dispatched measurement against the 5-fixture M182 corpus drawn from real GitHub issues across paiml/decy + paiml/bashrs + paiml/depyler.

Why PROPOSED not ACTIVE_RUNTIME

No operator-dispatched bench has produced `evidence/phase-4/project-scale-scores.json` yet. The companion-repo M188 gate test ships 7 active synthetic-fixture assertions (bidirectional sensitivity verified) + 1 `#[ignore]`'d live-evidence test that fires only after operator dispatch. Once the operator runs `bash scripts/phase-4-bench.sh` and the gate passes against real data, a v1.29.0 bump will flip PROPOSED → ACTIVE_RUNTIME.

Companion-repo Phase 4 sequence

Test plan

  • `pv validate contracts/claude-code-parity-apr-v1.yaml` clean (0 errors / 0 warnings)
  • Gate count 16 → 17 with CCPA-017 added to invariants[] summary + full falsification_conditions block
  • New status_history entry recording M180-M188 Phase 4 sequence
  • Companion-side test harness exists at `crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs` (companion-repo M188)
  • CI green

🤖 Generated with Claude Code

…t_scale_parity_bound (PROPOSED)

Adds 1 new falsification gate to the registry: CCPA-017 (project-scale
parity bound). Gate count: 16 → 17.

Phase 4 closes the function-scale → project-scale extrapolation gap
the M159 ProgramBench prior-art (arXiv:2605.03546) flagged. 0%/200
fully-resolved across Claude Opus/Sonnet/Haiku + GPT + Gemini at
project-scale means a CCPA-016-style "both pass" assertion is
implausible. CCPA-017 inverts the question: partial-progress
agreement, not all-or-nothing.

DUAL-threshold design:
- partial_agreement >= 0.3
- files_jaccard_corpus >= 0.3

Both orthogonal channels must show agreement. Tentative 0.3/0.3
POC-tier floors; recalibration awaits first operator-dispatched
measurement.

CCPA-017 enters at status: PROPOSED because no operator-dispatched
measurement has produced evidence/phase-4/project-scale-scores.json
yet. The live-evidence test is #[ignore]'d until that file exists.
Once the operator runs `bash scripts/phase-4-bench.sh` and the gate
passes against real data, a v1.29.0 bump will flip PROPOSED →
ACTIVE_RUNTIME.

Companion-repo Phase 4 sequence:
- M180 (PR #167 squash c7107b9) — phase-4-project-scale-plan.md
- M182 (PR #169 squash b36ceb6) — P4.1 corpus (5 real GitHub issues)
- M184 (PR #171 squash 0f8c451) — P4.2 runner (phase-4-bench.sh)
- M186 (PR #173 squash c115966) — P4.3 scoring (project_scale_diff.rs)
- M188 (PR #175 squash a574655) — P4.4 gate test scaffold

Gate-level statuses post-v1.28.0: 4 ACTIVE_RUNTIME (CCPA-013/014/015/
016) + 1 PROPOSED (CCPA-017) + rest at PLANNED_M*/IN_REVIEW per
their lifecycle phase. No OPEN residue.

Pure additive bump: new gate + new status_history entry. No schema
bump in aprender-contracts/src/schema/. pv validate clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 06:16
noahgift and others added 6 commits May 15, 2026 09:34
Five-whys for the recurring workspace-test "no such file or directory:
'...rcgu.o'" linker failure that has caused 2+ runs to fail on
aprender#1684:

1. Why does workspace-test fail? Linker can't find yoke_derive-*.rcgu.o
   intermediate compile artifacts.
2. Why are they missing? Cargo was killed mid-compile (no .rcgu.o
   written yet) but .rmeta files already existed.
3. Why was cargo killed? GitHub Actions concurrency.cancel-in-progress
   killed the previous run on the same PR when a new commit arrived.
4. Why does this persist? The target dir is per-PR-persistent at
   /mnt/nvme-raid0/targets/aprender-ci/<PR>/, so partial-compile state
   survives across runs.
5. Root cause: cargo's incremental state is not atomic-on-kill.
   cancel-in-progress + persistent per-PR target dir = inconsistent
   rmeta/rcgu.o pairs.

Fix: wrap `cargo test` with detect-and-recover logic:
- On link-failure pattern (rcgu.o file-not-found), `cargo clean`
  and retry once.
- On any other failure, propagate exit code unchanged.

Adds latency only on the damage-recovery path; warm-cache happy
path is unchanged.

Affects only the "Workspace lib tests (25,300+)" step. Other steps
(Compute tests, Integration tests, Build.rs check) reuse the
cleaned target dir from the recovery path if it fires.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The initial workspace-test recovery (commit f651296) only matched
the rcgu.o linker-failure pattern. Aprender#1684 run 25907945016
exposed two additional cancel-damage patterns the original regex
missed:

  Pattern 2: cc-rs missing build subdirs
    cargo:warning=Fatal error: can't create
      /workspace/target/debug/build/zstd-sys-X/out/Y.o:
      No such file or directory

  Pattern 3: rustc orphan rmeta
    error: extern location for libloading does not exist:
      /workspace/target/debug/deps/liblibloading-X.rmeta

Both patterns indicate the same root cause (partial-compile state
from a SIGKILL'd previous run) but manifest in different cargo
subsystems depending on how far the cancelled build got.

Updated regex matches ANY of the three known damage patterns:
  no such file or directory.*\.rcgu\.o
  Fatal error: can.t create.*\.o: No such file
  extern location for .* does not exist.*\.rmeta

On match: cargo clean + retry once. Adds latency only on the
damage-recovery path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift disabled auto-merge May 15, 2026 09:59
noahgift and others added 7 commits May 15, 2026 12:21
…l-damage

cargo clean (commits f651296 + df1c2c767) is insufficient. Observed
in aprender#1684 runs 25907945016 + 25912792473: after recovery
cargo clean, the retry fails within ~52s with:

  error: couldn't create a temp dir: No such file or directory
    (os error 2) at path "/workspace/target/debug/deps/rmetaDAW518"

Root cause: sccache (mounted at /home/noah/data/sccache, shared
across all PRs) caches metadata that references compile artifact
paths the cargo clean just removed. The retry asks sccache for a
cache hit; sccache returns a metadata blob pointing at a path that
no longer exists; rustc's mkdir+write fails.

Two-prong fix:
1. OS-level rm -rf of target dir contents (more thorough than cargo
   clean; doesn't depend on cargo lockfiles being sane), plus
   pre-create target/debug/deps/ to avoid mkdir races on parallel
   rustc invocations.
2. Disable sccache during retry (no RUSTC_WRAPPER, no SCCACHE_DIR,
   no /sccache mount). Pays 30-40min cold-compile cost once to
   guarantee correctness over a fast-but-broken cached state.

Cost: only on the damage-recovery path. Warm-cache happy path
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…etry-on-failure)

Replaces the retry-on-failure recovery (commits 6a165a6 + 6fcd975)
with a prevention-based pre-flight check that runs ONCE at the start
of workspace-test.

Five-whys (root cause):
1. Why does workspace-test fail with "no such file .rcgu.o" / extern
   location missing / cc-rs can't create .o? Cargo's incremental
   state on the per-PR target dir is inconsistent.
2. Why inconsistent? Prior run was SIGKILL'd mid-compile.
3. Why SIGKILL'd? concurrency.cancel-in-progress (line 22) cancels
   the previous run when a new commit lands (every "Update branch" /
   strict-up-to-date trigger when main moves).
4. Why does this persist? Target dir is bind-mounted from per-PR
   persistent path; partial-compile state survives across runs.
5. Root cause: cargo's incremental state is not atomic-on-kill, so
   persistent shared target dir + cancel-in-progress = damage.

Prevention: at job start, check the previous workflow run on this
branch via gh api. If conclusion was "cancelled", rm -rf the target
dir BEFORE invoking cargo. This addresses the root cause by ensuring
no damaged state enters the cargo invocation in the first place.

This is NOT retry-on-failure (which the operator rejects under the
"flake is not allowed" directive). It is one-time prevention based
on a verifiable signal (prior run conclusion = cancelled).

Cost: ~30-40min cold rebuild ONLY on runs following a cancellation.
Warm-cache happy path (no prior cancellation): zero added latency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 2df0cd4 into main May 16, 2026
18 of 20 checks passed
@noahgift noahgift deleted the m190-ccpa017-v1.28.0 branch May 16, 2026 06:05
noahgift added a commit that referenced this pull request May 17, 2026
…te CCPA-008 (#1735)

THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original
v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged
and deleted its feature branch. Companion-repo has been at the v1.29.0
contract YAML since M208 (pin.lock pointed at #1705's feature-branch
HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's
contract content AND adds the M224/M230 deltas the operator-dispatched
Phase 5 Arena bench produced.

CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate
registry at status: PROPOSED. Gate count: 17 → 18. Asserts
recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture
project-scale corpus driven via the live multi-turn Arena harness
(crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY
(does the agent recover when bash fails?), distinct from CCPA-016/017
which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic
fixture (100% pass BUT zero recovery → FAILS recovery floor) is the
canonical R3 distinguishing test.

CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to
status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95,
per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the
INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation
("apr code matches claude on real engineering tasks") → METER VALIDATION
("the differ + scorer + per-tool equivalence rules correctly recognize
equivalent traces"). The system-level interpretation was empirically
FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity
claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale,
PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED).

CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench
result in status_history. Operator ran scripts/phase-5-arena-bench.sh
against the M182 5-fixture project-scale corpus three times:
  - Run 1 (180s/turn) was noisy (6/10 timeout-killed)
  - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student
    apr-serve errors
  - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly
    completed 20 turns
All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate
= 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) →
FalsifierOutcome::StaticFalsified.

Important nuance preserved in the status_history reason field: 0/5 for
BOTH systems means neither solves these specific tasks under this
harness — Axis 2 closure CEILING, not teacher-vs-student gap. The
Popperian comparator is deterministic; if a cleaner re-run (post
aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the
verdict revises automatically.

Cross-references in this PR:
- companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure
- companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror
- companion-repo M224 (PR #211) = evidence + headline revision
- companion-repo M226 (PR #213) = aprender#1712 + pkill workaround
- companion-repo M230 (PR #215) = soft-deprecation spec rewrite +
  new docs/specifications/static-fixture-deprecation.md (~140 lines)
- aprender#1712 = apr serve subprocess leak (root cause of the 3
  remaining student driver_errors in Run 3)

Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL
be recalibrated after a cleaner re-run post-aprender#1712 upstream fix.

`pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0
warnings. Pure additive bump (CCPA-018) + interpretation amendment
(CCPA-008) + history record (M224). No schema change, no existing gate
behavior touched.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…P0-K root cause discovered (#1738)

* contracts(ccpa): v1.28.0 → v1.30.0 — register CCPA-018 + soft-deprecate CCPA-008

THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original
v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged
and deleted its feature branch. Companion-repo has been at the v1.29.0
contract YAML since M208 (pin.lock pointed at #1705's feature-branch
HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's
contract content AND adds the M224/M230 deltas the operator-dispatched
Phase 5 Arena bench produced.

CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate
registry at status: PROPOSED. Gate count: 17 → 18. Asserts
recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture
project-scale corpus driven via the live multi-turn Arena harness
(crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY
(does the agent recover when bash fails?), distinct from CCPA-016/017
which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic
fixture (100% pass BUT zero recovery → FAILS recovery floor) is the
canonical R3 distinguishing test.

CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to
status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95,
per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the
INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation
("apr code matches claude on real engineering tasks") → METER VALIDATION
("the differ + scorer + per-tool equivalence rules correctly recognize
equivalent traces"). The system-level interpretation was empirically
FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity
claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale,
PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED).

CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench
result in status_history. Operator ran scripts/phase-5-arena-bench.sh
against the M182 5-fixture project-scale corpus three times:
  - Run 1 (180s/turn) was noisy (6/10 timeout-killed)
  - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student
    apr-serve errors
  - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly
    completed 20 turns
All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate
= 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) →
FalsifierOutcome::StaticFalsified.

Important nuance preserved in the status_history reason field: 0/5 for
BOTH systems means neither solves these specific tasks under this
harness — Axis 2 closure CEILING, not teacher-vs-student gap. The
Popperian comparator is deterministic; if a cleaner re-run (post
aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the
verdict revises automatically.

Cross-references in this PR:
- companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure
- companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror
- companion-repo M224 (PR #211) = evidence + headline revision
- companion-repo M226 (PR #213) = aprender#1712 + pkill workaround
- companion-repo M230 (PR #215) = soft-deprecation spec rewrite +
  new docs/specifications/static-fixture-deprecation.md (~140 lines)
- aprender#1712 = apr serve subprocess leak (root cause of the 3
  remaining student driver_errors in Run 3)

Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL
be recalibrated after a cleaner re-run post-aprender#1712 upstream fix.

`pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0
warnings. Pure additive bump (CCPA-018) + interpretation amendment
(CCPA-008) + history record (M224). No schema change, no existing gate
behavior touched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(p2c): SPEC §84 P2-C live findings — audit hypothesis falsified, P0-K root cause discovered (PMAT-681 → PMAT-690)

P2-C 50K-step training dispatched on lambda-vector cuda:0 (2026-05-17):

  Multi-source corpus assembly:  17 min (49.6B tokens, 18.3M docs, 22.5K docs/s)
  Pull the-stack-dedup:           ~6 min  (28.6 GB / 144 parquet)
  Pull codeparrot-clean:          ~6 min  (12.8 GB / 54 .json.gz)
  Decompress gz → jsonl:          ~25 s
  Tokenize merge → qwen-v3:       17 min  (5,000 .bin shards, 4× compute-optimal)
  Training (50K steps requested): EARLY_STOP at 27 epochs / 2700 steps

Best val_loss: 4.91 @ epoch 20 — IDENTICAL termination shape to §82
(which had 1.24B-token single-source corpus). The audit's
Chinchilla-data-starvation hypothesis is FALSIFIED.

Corpus comparison:

  §82 qwen-v2:  1.24B tokens, 1 source, val_loss best 4.71
  P2-C qwen-v3: 49.6B tokens, 2 sources, val_loss best 4.91

80× more corpus tokens produced 0.2 worse val_loss (likely
held-out val set distribution effect, not real regression).

Root cause discovered (NEW P0-K, PMAT-690):

  apr convert (HF-safetensors → APR import path) does NOT stamp
  hf_architecture, embedded tokenizer.vocabulary, or tokenizer.merges
  into the imported APR. The §81-§83 5-PR Class 3 cascade
  (P0-D/E/F/G/H/J) wired downstream propagation correctly, but had
  nothing to propagate because the upstream producer was incomplete.
  Live P2-C trained checkpoint re-exhibits all 5 prior failures:
    - apr qa  → "APR missing embedded tokenizer"
    - apr bench → PASS (315.6 tok/s — C-03 arch dims are stamped
      by training even when init didn't have them)
    - apr export → 72 qkv biases leak as passthrough (arch stays Llama)
    - llama-cli → "cannot find tokenizer merges in model file"

Methodology lesson #33 NEW: upstream metadata defects masquerade as
downstream packaging defects. When 5th Class 3 fix is in the same
area, pause and check the upstream producer. ~30 min inventorying
the producer is cheaper than a 6th, 7th, 8th consumer fix.

Ship %: stays at 79.

Next:
- PMAT-690 P0-K (NEW critical): apr convert stamps hf_architecture +
  tokenizer.vocabulary + tokenizer.merges. Scope ~100 LOC.
- After P0-K: re-import qwen2.5-coder-0.5b → re-train → re-export →
  llama-cli should work end-to-end, transitively closing PMAT-679 P0-H.

Files:
  evidence/p2c-2026-05-17/findings.md
  evidence/p2c-2026-05-17/loss-trajectory.tsv  (27-epoch trace)
  evidence/p2c-2026-05-17/bench-epoch-020.json  (315.6 tok/s)
  evidence/p2c-2026-05-17/epoch-020.metadata.json
  docs/roadmaps/roadmap.yaml  (PMAT-681 → completed, PMAT-690 added)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant