Skip to content

docs(spec): SHIP-TWO-001 §62 — §61.8 Branch A fully closed across 3 PRs; 80% pass@1 on 10-problem HumanEval sample#1618

Closed
noahgift wants to merge 2 commits into
mainfrom
docs/ship-two-spec-section-62-branch-a-closure
Closed

docs(spec): SHIP-TWO-001 §62 — §61.8 Branch A fully closed across 3 PRs; 80% pass@1 on 10-problem HumanEval sample#1618
noahgift wants to merge 2 commits into
mainfrom
docs/ship-two-spec-section-62-branch-a-closure

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across three same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005.

Branch A Closure (3 PRs, same defect class, 3 call sites)

PR Surface Fix Evidence
#1615 output_verification.rs::golden_output_apr Reroute through realizar::run_inference + with_input_tokens SHIP-006 LIVE-discharged (12/12 gates)
#1616 eval/inference.rs::run_humaneval_inference Same reroute pattern HumanEval/0 → canonical solution emitted
#1617 eval/inference.rs::align_continuation_indent (NEW) Dedent over-indented body; stop at 0-indent post-amble HumanEval/0 1/1 PASS post-fix

LIVE Evidence (2026-05-11, lambda-vector RTX 4090)

10-problem HumanEval sample on canonical 7B APR teacher (sha256 a394dd28…):

  • Result: 8/10 = 80% pass@1
  • Per-problem: 0/1/3/4/5/7/8/9 PASS; 2/6 FAIL
  • 95% binomial CI [44%, 97%] — within statistical noise of 86% nominal floor

Full 164-Problem Run

Dispatched in background 2026-05-11 (~5h CPU wall; pre-authorized per feedback_compute_pre_authorized.md 48h ceiling). When complete:

  • pass@1 ≥ 84.80% → SHIP-005 LIVE-discharged → MODEL-1 ship % 94% → 95%
  • pass@1 < 84.80% → teacher quality regression hypothesis surfaces

Methodology Lesson #10 (NEW)

Branch closure is a multi-PR cascade across distinct call sites. Prior lessons #6-#9 covered single-bug cascades; #10 generalises: a "single bug class" may require multi-PR surgical fixes across multiple call sites that share the same root cause.

Changes

  • docs/specifications/aprender-train/ship-two-models-spec.md:
    • Atomic next action banner: v3.06.0 → v3.08.0
    • New §62 sub-section ABOVE §61 (newest-first), with 7 sub-sub-sections
  • evidence/section-62-branch-a-closure-2026-05-11/ (NEW):
    • humaneval-10-result.json (raw apr eval --json output)
    • findings.json (structured 3-PR cascade record)

Ship-% Movement

  • MODEL-1 ship %: stays at 94% pending full 164-problem run
  • MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 11, 2026 08:21
…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62)

Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns"
degenerate output bug) across THREE same-class PRs, plus the LIVE
10-problem HumanEval empirical signal for SHIP-005.

Branch A closure pattern (3 PRs, same defect class, 3 call sites):
- PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr
  Reroute through realizar::run_inference + with_input_tokens.
  Discharge: SHIP-006 LIVE (apr qa 12/12 gates).
- PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference
  Reroute through same path. Model emits canonical solution
  structure but Python test FAILs on whitespace artifact.
- PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent
  NEW post-processing fn: dedent over-indented body by N spaces;
  stop at first 0-indent non-empty line (preserve post-amble).
  Discharge: HumanEval/0 1/1 PASS post-fix.

LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090):
- apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0
- Result: passed = 8/10 = 80% pass@1
- Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL
- 95% binomial CI on 8/10: [44%, 97%] — within statistical
  noise of 86% nominal SHIP-005 floor
- Full 164-problem run dispatched in background
  (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per
  feedback_compute_pre_authorized.md 48h ceiling)

Five-Whys for the §62 amendment:
1. Why §62 now and not wait for 164 result? The 3-PR closure is
   a substantial cascade record that deserves spec-level
   permanence; 164-result is a separate "ship-%-flip" event that
   gets its own follow-up amendment when it lands.
2. Why 3 PRs for one bug class? The legacy AprTransformer path
   was wired in 3 distinct callsites (golden_output, humaneval,
   indent-residual post-processing). Each needs its own surgical
   reroute / post-process — fixing one doesn't fix the others.
3. Why is methodology lesson #10 worth recording? Prior
   methodology lessons (#6-#9) covered single-bug cascades. #10
   generalises: "single bug class" may need multi-PR surgical
   fixes when manifest across multiple call sites.
4. Why ≤95% binomial CI is enough confidence to dispatch full 164?
   The 10-problem sample's 80% is well within the [44%, 97%] CI
   of the contract floor (84.80% effective). Full 164 dispatch
   reduces N=10 → N=164 → much tighter CI.
5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive
   record of 3-PR cascade closure + new empirical evidence; it
   warrants a minor version bump.

Changes (1 spec file + 1 evidence directory):
- docs/specifications/aprender-train/ship-two-models-spec.md:
  - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0
    which was claimed by PR #1611 in queue — once that lands,
    rebase to renumber if needed)
  - New §62 sub-section ABOVE §61 (newest-first ordering), with
    7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem
    LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run
    dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement,
    62.7 what §62 is NOT
- evidence/section-62-branch-a-closure-2026-05-11/ (NEW):
  - humaneval-10-result.json (raw apr eval --json output)
  - findings.json (structured 3-PR cascade record + per-problem
    pass results + dispatch metadata)

Validation:
- Section format consistent with §61 (newest-first, dated, sub-
  sections numbered §62.X)
- All 3 cascade PRs referenced explicitly
- Empirical evidence reproducible via captured JSON

Spec movement:
- v3.06.0 → v3.08.0
- MODEL-1 ship %: stays at 94% pending 164-run completion
- MODEL-2 ship %: unchanged at 57%

Refs:
- evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence)
- PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr)
- PR #1616 (HumanEval inference path fix)
- PR #1617 (HumanEval indent residual fix — align_continuation_indent)
- SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy)
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- feedback_compute_pre_authorized.md (lambda-labs 48h ceiling)

Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-two-spec-section-62-branch-a-closure branch from 7cfb225 to 062b720 Compare May 11, 2026 14:38
@noahgift
Copy link
Copy Markdown
Contributor Author

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

@noahgift noahgift closed this May 12, 2026
auto-merge was automatically disabled May 12, 2026 15:30

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant