docs(spec): SHIP-TWO-001 §62 — §61.8 Branch A fully closed across 3 PRs; 80% pass@1 on 10-problem HumanEval sample by noahgift · Pull Request #1618 · paiml/aprender

noahgift · 2026-05-11T08:21:52Z

Summary

Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across three same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005.

Branch A Closure (3 PRs, same defect class, 3 call sites)

PR	Surface	Fix	Evidence
#1615	`output_verification.rs::golden_output_apr`	Reroute through `realizar::run_inference + with_input_tokens`	SHIP-006 LIVE-discharged (12/12 gates)
#1616	`eval/inference.rs::run_humaneval_inference`	Same reroute pattern	HumanEval/0 → canonical solution emitted
#1617	`eval/inference.rs::align_continuation_indent` (NEW)	Dedent over-indented body; stop at 0-indent post-amble	HumanEval/0 1/1 PASS post-fix

LIVE Evidence (2026-05-11, lambda-vector RTX 4090)

10-problem HumanEval sample on canonical 7B APR teacher (sha256 a394dd28…):

Result: 8/10 = 80% pass@1
Per-problem: 0/1/3/4/5/7/8/9 PASS; 2/6 FAIL
95% binomial CI [44%, 97%] — within statistical noise of 86% nominal floor

Full 164-Problem Run

Dispatched in background 2026-05-11 (~5h CPU wall; pre-authorized per feedback_compute_pre_authorized.md 48h ceiling). When complete:

pass@1 ≥ 84.80% → SHIP-005 LIVE-discharged → MODEL-1 ship % 94% → 95%
pass@1 < 84.80% → teacher quality regression hypothesis surfaces

Methodology Lesson #10 (NEW)

Branch closure is a multi-PR cascade across distinct call sites. Prior lessons #6-#9 covered single-bug cascades; #10 generalises: a "single bug class" may require multi-PR surgical fixes across multiple call sites that share the same root cause.

Changes

docs/specifications/aprender-train/ship-two-models-spec.md:
- Atomic next action banner: v3.06.0 → v3.08.0
- New §62 sub-section ABOVE §61 (newest-first), with 7 sub-sub-sections
evidence/section-62-branch-a-closure-2026-05-11/ (NEW):
- humaneval-10-result.json (raw apr eval --json output)
- findings.json (structured 3-PR cascade record)

Ship-% Movement

MODEL-1 ship %: stays at 94% pending full 164-problem run
MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62) Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across THREE same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005. Branch A closure pattern (3 PRs, same defect class, 3 call sites): - PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr Reroute through realizar::run_inference + with_input_tokens. Discharge: SHIP-006 LIVE (apr qa 12/12 gates). - PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference Reroute through same path. Model emits canonical solution structure but Python test FAILs on whitespace artifact. - PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent NEW post-processing fn: dedent over-indented body by N spaces; stop at first 0-indent non-empty line (preserve post-amble). Discharge: HumanEval/0 1/1 PASS post-fix. LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090): - apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0 - Result: passed = 8/10 = 80% pass@1 - Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL - 95% binomial CI on 8/10: [44%, 97%] — within statistical noise of 86% nominal SHIP-005 floor - Full 164-problem run dispatched in background (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per feedback_compute_pre_authorized.md 48h ceiling) Five-Whys for the §62 amendment: 1. Why §62 now and not wait for 164 result? The 3-PR closure is a substantial cascade record that deserves spec-level permanence; 164-result is a separate "ship-%-flip" event that gets its own follow-up amendment when it lands. 2. Why 3 PRs for one bug class? The legacy AprTransformer path was wired in 3 distinct callsites (golden_output, humaneval, indent-residual post-processing). Each needs its own surgical reroute / post-process — fixing one doesn't fix the others. 3. Why is methodology lesson #10 worth recording? Prior methodology lessons (#6-#9) covered single-bug cascades. #10 generalises: "single bug class" may need multi-PR surgical fixes when manifest across multiple call sites. 4. Why ≤95% binomial CI is enough confidence to dispatch full 164? The 10-problem sample's 80% is well within the [44%, 97%] CI of the contract floor (84.80% effective). Full 164 dispatch reduces N=10 → N=164 → much tighter CI. 5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive record of 3-PR cascade closure + new empirical evidence; it warrants a minor version bump. Changes (1 spec file + 1 evidence directory): - docs/specifications/aprender-train/ship-two-models-spec.md: - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0 which was claimed by PR #1611 in queue — once that lands, rebase to renumber if needed) - New §62 sub-section ABOVE §61 (newest-first ordering), with 7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement, 62.7 what §62 is NOT - evidence/section-62-branch-a-closure-2026-05-11/ (NEW): - humaneval-10-result.json (raw apr eval --json output) - findings.json (structured 3-PR cascade record + per-problem pass results + dispatch metadata) Validation: - Section format consistent with §61 (newest-first, dated, sub- sections numbered §62.X) - All 3 cascade PRs referenced explicitly - Empirical evidence reproducible via captured JSON Spec movement: - v3.06.0 → v3.08.0 - MODEL-1 ship %: stays at 94% pending 164-run completion - MODEL-2 ship %: unchanged at 57% Refs: - evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence) - PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr) - PR #1616 (HumanEval inference path fix) - PR #1617 (HumanEval indent residual fix — align_continuation_indent) - SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - feedback_compute_pre_authorized.md (lambda-labs 48h ceiling) Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T15:30:51Z

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

noahgift enabled auto-merge (squash) May 11, 2026 08:21

noahgift force-pushed the docs/ship-two-spec-section-62-branch-a-closure branch from 7cfb225 to 062b720 Compare May 11, 2026 14:38

Merge branch 'main' into docs/ship-two-spec-section-62-branch-a-closure

d2e0dbe

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 15:30
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §62 — §61.8 Branch A fully closed across 3 PRs; 80% pass@1 on 10-problem HumanEval sample#1618

docs(spec): SHIP-TWO-001 §62 — §61.8 Branch A fully closed across 3 PRs; 80% pass@1 on 10-problem HumanEval sample#1618
noahgift wants to merge 2 commits into
mainfrom
docs/ship-two-spec-section-62-branch-a-closure

noahgift commented May 11, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 11, 2026

Summary

Branch A Closure (3 PRs, same defect class, 3 call sites)

LIVE Evidence (2026-05-11, lambda-vector RTX 4090)

Full 164-Problem Run

Methodology Lesson #10 (NEW)

Changes

Ship-% Movement

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant