fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation) by noahgift · Pull Request #1616 · paiml/aprender

noahgift · 2026-05-11T07:35:24Z

Summary

Continuation of the §61.8 Branch A bug-class fix from PR #1615 (SHIP-006). The HumanEval evaluation harness run_humaneval_inference was using the same broken legacy AprTransformer + forward_with_cache + AprKVCache path. Reroute through realizar::run_inference + InferenceConfig::with_input_tokens (the working path used by SHIP-002/006/008 LIVE-discharges).

Why Not Claim SHIP-005 LIVE-Discharge?

Smoke test on HumanEval/0 shows:

Pre-fix: 0/1 pass — model produces broken output via legacy AprTransformer.
Post-fix: model produces canonical pairwise-comparison solution structure (for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) < threshold: return True; return False), BUT the test still fails due to a leading-whitespace alignment artifact:
- Model emits for i... (5-space indent) instead of for i... (4-space)
- Concatenated with prompt """\n → 4-space indent for docstring close + 5-space for-loop body → Python IndentationError

This is a separate residual issue in raw-continuation tokenization at the prompt-completion boundary. Manual apr run on the same model with auto-wrap produces clean 4-space output, so the model itself is healthy.

Fix

crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference:

Replace load_humaneval_model + forward_with_cache + AprKVCache + manual sampling with realizar::run_inference per problem
Use InferenceConfig::with_input_tokens to pass pre-tokenized raw Python (bypasses ChatML auto-wrap — HumanEval is raw-continuation, not chat)

Validation

cargo build -p apr-cli --release --features cuda — clean
Smoke test: model produces canonical solution structure (verified)
Test still fails on indentation alignment — documented as residual
Pattern matches PR fix(apr-cli) + feat(contracts): SHIP-006 PARTIAL → DISCHARGED + Branch A bug fix #1615 fix for golden_output_apr (SHIP-006)

Residual (Follow-up PR)

Leading-whitespace alignment in raw-continuation HumanEval outputs. Three possible fixes:

Post-process completion to normalize indentation
Prompt engineering nudge toward 4-space
Investigate tokenizer's space-prefix behavior at boundary

Ship-% Movement

MODEL-1 ship %: unchanged at 94% (infrastructure fix; LIVE discharge deferred)
MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

…ODE-SHIP-005-FIX) Same Branch A bug class as PR #1615 (SHIP-006 fix). The HumanEval evaluation harness `run_humaneval_inference` was using the legacy `AprTransformer::from_apr_file + forward_with_cache + AprKVCache` path that SHIP-002, SHIP-006, and SHIP-008 LIVE-discharges proved broken on the canonical 7B teacher. Reroute through `realizar::run_inference + InferenceConfig::with_input_tokens` (the working path used by all three prior LIVE-discharges). Five-Whys: 1. Why HumanEval evaluation 0/3 pass on canonical 7B teacher? Same bug class as SHIP-006 golden_output_apr — legacy AprTransformer path produces broken output. 2. Why is AprTransformer broken? Pre-§60 the APR forward path wasn't routed through Q4K+Q8K dispatch; M-FFN-GGUF-5 fix (#1550) updated `forward_traced` but not the standalone `forward_with_cache` path. 3. Why fix the call site? Routing through `run_inference` uses path proven via SHIP-002/006/008 — minimum-risk fix. 4. Why `with_input_tokens` not `with_prompt`? HumanEval prompts are raw Python code with docstrings; passing via `with_prompt` would trigger `prepare_tokens_apr`'s ChatML auto-wrap that would wrap raw Python in `<|im_start|>user...` (off-spec for HumanEval which is raw-continuation evaluation). 5. Why ship this WITHOUT claiming SHIP-005 LIVE discharge? Smoke test shows the model now produces semantically-correct solutions (canonical pairwise comparison for HumanEval/0) but with a leading-whitespace artifact (5-space indent vs expected 4-space). This is a separate residual issue in raw-continuation tokenization that needs its own investigation. The inference-path fix is independently valuable and unblocks the next step. Fix (1 file changed): - `crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference`: - Replace `load_humaneval_model` + `forward_with_cache` + `AprKVCache` + manual sampling loop with `realizar::run_inference` per problem - Use `InferenceConfig::with_input_tokens` to pass pre-tokenized raw-Python prompt (bypasses ChatML auto-wrap) - Slice completion from `result.text` by stripping the prompt prefix, with token-level fallback if text doesn't begin with prompt verbatim LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090): - `apr eval <canonical 7B APR teacher> --task humaneval --data <1-problem> --samples 1 --temperature 0.0 -v`: - Pre-fix: HumanEval/0 → 0/1 pass (broken legacy AprTransformer path) - Post-fix: HumanEval/0 → semantically-correct completion produced (canonical pairwise-comparison `for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) < threshold: return True; return False`), but test still FAILs due to leading-whitespace alignment artifact (5-space vs expected 4-space). - Manual `apr run --prompt <prompt>` on same model produces clean 4-space-indent output — confirms model is healthy and bug is raw-continuation tokenization specific. Validation: - cargo build -p apr-cli --release --features cuda ✓ (clean) - Smoke test: model produces canonical solution structure (verified manually); execute_python_test fails on indentation only Residual (NOT in this PR — separate follow-up): - Leading-whitespace alignment in raw-continuation HumanEval outputs. Model emits ` for i...` (5-space indent) instead of ` for i...` (4-space indent) after ` """\n` prompt suffix. Needs either: (a) post-process completion to normalize indentation, (b) prompt engineering to nudge model toward 4-space, (c) investigate tokenizer's space-prefix behavior at the prompt-completion boundary. This residual blocks SHIP-005 LIVE-discharge; will be addressed in a follow-up PR. Spec movement: - MODEL-1 ship %: unchanged at 94% (infrastructure fix; LIVE discharge of SHIP-005 deferred pending whitespace residual) - MODEL-2 ship %: unchanged at 57% Refs: - crates/apr-cli/src/commands/output_verification.rs:492 (same fix pattern shipped in PR #1615 for golden_output_apr) - contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005 - SPEC-SHIP-TWO-001 §61.8 (Branch A bug class) Closes the infrastructure portion of task #33 PMAT-CODE-SHIP-005-FIX-DISCHARGE. LIVE discharge of SHIP-005 remains a follow-up task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…05 whitespace residual (#1617) * fix(apr-cli): route HumanEval inference through run_inference (PMAT-CODE-SHIP-005-FIX) Same Branch A bug class as PR #1615 (SHIP-006 fix). The HumanEval evaluation harness `run_humaneval_inference` was using the legacy `AprTransformer::from_apr_file + forward_with_cache + AprKVCache` path that SHIP-002, SHIP-006, and SHIP-008 LIVE-discharges proved broken on the canonical 7B teacher. Reroute through `realizar::run_inference + InferenceConfig::with_input_tokens` (the working path used by all three prior LIVE-discharges). Five-Whys: 1. Why HumanEval evaluation 0/3 pass on canonical 7B teacher? Same bug class as SHIP-006 golden_output_apr — legacy AprTransformer path produces broken output. 2. Why is AprTransformer broken? Pre-§60 the APR forward path wasn't routed through Q4K+Q8K dispatch; M-FFN-GGUF-5 fix (#1550) updated `forward_traced` but not the standalone `forward_with_cache` path. 3. Why fix the call site? Routing through `run_inference` uses path proven via SHIP-002/006/008 — minimum-risk fix. 4. Why `with_input_tokens` not `with_prompt`? HumanEval prompts are raw Python code with docstrings; passing via `with_prompt` would trigger `prepare_tokens_apr`'s ChatML auto-wrap that would wrap raw Python in `<|im_start|>user...` (off-spec for HumanEval which is raw-continuation evaluation). 5. Why ship this WITHOUT claiming SHIP-005 LIVE discharge? Smoke test shows the model now produces semantically-correct solutions (canonical pairwise comparison for HumanEval/0) but with a leading-whitespace artifact (5-space indent vs expected 4-space). This is a separate residual issue in raw-continuation tokenization that needs its own investigation. The inference-path fix is independently valuable and unblocks the next step. Fix (1 file changed): - `crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference`: - Replace `load_humaneval_model` + `forward_with_cache` + `AprKVCache` + manual sampling loop with `realizar::run_inference` per problem - Use `InferenceConfig::with_input_tokens` to pass pre-tokenized raw-Python prompt (bypasses ChatML auto-wrap) - Slice completion from `result.text` by stripping the prompt prefix, with token-level fallback if text doesn't begin with prompt verbatim LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090): - `apr eval <canonical 7B APR teacher> --task humaneval --data <1-problem> --samples 1 --temperature 0.0 -v`: - Pre-fix: HumanEval/0 → 0/1 pass (broken legacy AprTransformer path) - Post-fix: HumanEval/0 → semantically-correct completion produced (canonical pairwise-comparison `for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) < threshold: return True; return False`), but test still FAILs due to leading-whitespace alignment artifact (5-space vs expected 4-space). - Manual `apr run --prompt <prompt>` on same model produces clean 4-space-indent output — confirms model is healthy and bug is raw-continuation tokenization specific. Validation: - cargo build -p apr-cli --release --features cuda ✓ (clean) - Smoke test: model produces canonical solution structure (verified manually); execute_python_test fails on indentation only Residual (NOT in this PR — separate follow-up): - Leading-whitespace alignment in raw-continuation HumanEval outputs. Model emits ` for i...` (5-space indent) instead of ` for i...` (4-space indent) after ` """\n` prompt suffix. Needs either: (a) post-process completion to normalize indentation, (b) prompt engineering to nudge model toward 4-space, (c) investigate tokenizer's space-prefix behavior at the prompt-completion boundary. This residual blocks SHIP-005 LIVE-discharge; will be addressed in a follow-up PR. Spec movement: - MODEL-1 ship %: unchanged at 94% (infrastructure fix; LIVE discharge of SHIP-005 deferred pending whitespace residual) - MODEL-2 ship %: unchanged at 57% Refs: - crates/apr-cli/src/commands/output_verification.rs:492 (same fix pattern shipped in PR #1615 for golden_output_apr) - contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005 - SPEC-SHIP-TWO-001 §61.8 (Branch A bug class) Closes the infrastructure portion of task #33 PMAT-CODE-SHIP-005-FIX-DISCHARGE. LIVE discharge of SHIP-005 remains a follow-up task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(apr-cli): align HumanEval raw-continuation indent (PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL) Closes the whitespace residual flagged by PR #1616. Model emits 1-space over-indent at the prompt-completion boundary on raw- continuation HumanEval prompts (where the prompt ends with ` """\n` and the function body must be at 4-space indent). The BPE tokenizer encodes ` for` (1-leading-space) as a common starting token after a post-docstring `\n`, producing 5-space indent when concatenated. Fix: `align_continuation_indent(prompt, completion)` post-processes the completion before Python execution: 1. Compute prompt's expected continuation indent (last non-empty line's leading-space count). 2. Compute completion's first non-empty line indent. 3. If completion is over-indented by N spaces, dedent every line inside the function body by N. 4. Stop dedenting at the first 0-indent non-empty line (top-level code like `if __name__ == "__main__":` post-amble — preserve its scope). Five-Whys: 1. Why HumanEval/0 FAIL post-PR-#1616? IndentationError on concatenated ` """\n for i...` — 5-space body indent. 2. Why does model emit 5-space? BPE token ` for` (1-leading-space) gets appended after the prompt's `\n`; effective indent is prompt's 4 + token's 1 = 5. 3. Why didn't `apr run` (auto-wrap path) show this? Auto-wrap passes through ChatML which puts the model in assistant role — model writes fresh code with the canonical 4-space indent. Raw-continuation puts the model at the function-body boundary where the tokenizer adds the extra space. 4. Why post-process rather than fix tokenization? Post-processing is the conservative one-PR fix; tokenization changes have a much wider blast radius (would affect every raw-continuation call across the stack). 5. Why scope-track (`in_body` flag) instead of dedenting uniformly? Completions often include top-level post-amble like `if __name__ == "__main__":\n pass`. The ` pass` is at the test-runner's indent level (4), not the function's; if we dedent uniformly, we corrupt the post-amble to ` pass` (3-space — broken Python). Stop dedenting at the first non-empty 0-indent line. LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090): - HumanEval/0 single-problem smoke (~115s): - Pre-fix: pass@1 = 0/1 (IndentationError on 5-space body) - Post-fix: pass@1 = **1/1 = 100%** (canonical pairwise comparison `for i in range(len(numbers)): for j in range(i+1, ...): ...` now Python-executes cleanly) - 6 unit tests added (`align_indent_tests`): - `dedents_one_excess_space` ✓ (the SHIP-005 baseline case) - `passthrough_when_already_correct` ✓ (no-op safety) - `leaves_zero_indent_lines_untouched` ✓ (scope-track safety) - `dedents_multi_space_excess` ✓ (N-space generalisation) - `empty_completion` ✓ (degenerate input safety) - `no_indent_anywhere` ✓ (early-return guard) Fix (1 file changed): - `crates/apr-cli/src/commands/eval/inference.rs`: - + new fn `align_continuation_indent(prompt, completion) -> String` (6-section mutation survey) - Hook into `run_humaneval_inference` after `truncate_at_function_boundary` and before `execute_python_test` Validation: - cargo test -p apr-cli --release --features cuda commands::eval::inference → 6 passed, 0 failed - cargo build -p apr-cli --release --features cuda ✓ (clean) - LIVE HumanEval/0 1/1 PASS Spec movement (DEFERRED, not in this PR): - This is the LAST infrastructure blocker for SHIP-005 LIVE discharge. - Full 164-problem run on canonical 7B teacher dispatched separately. - Once SHIP-005 LIVE-discharges: MODEL-1 ship % 94% → 95%. Refs: - crates/apr-cli/src/commands/output_verification.rs:492 (PR #1615 — sibling fix) - crates/apr-cli/src/commands/eval/inference.rs (PR #1616 — eval inference path fix) - contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005 - SPEC-SHIP-TWO-001 §61.8 (Branch A bug class) Closes task #34 PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62) Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across THREE same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005. Branch A closure pattern (3 PRs, same defect class, 3 call sites): - PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr Reroute through realizar::run_inference + with_input_tokens. Discharge: SHIP-006 LIVE (apr qa 12/12 gates). - PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference Reroute through same path. Model emits canonical solution structure but Python test FAILs on whitespace artifact. - PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent NEW post-processing fn: dedent over-indented body by N spaces; stop at first 0-indent non-empty line (preserve post-amble). Discharge: HumanEval/0 1/1 PASS post-fix. LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090): - apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0 - Result: passed = 8/10 = 80% pass@1 - Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL - 95% binomial CI on 8/10: [44%, 97%] — within statistical noise of 86% nominal SHIP-005 floor - Full 164-problem run dispatched in background (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per feedback_compute_pre_authorized.md 48h ceiling) Five-Whys for the §62 amendment: 1. Why §62 now and not wait for 164 result? The 3-PR closure is a substantial cascade record that deserves spec-level permanence; 164-result is a separate "ship-%-flip" event that gets its own follow-up amendment when it lands. 2. Why 3 PRs for one bug class? The legacy AprTransformer path was wired in 3 distinct callsites (golden_output, humaneval, indent-residual post-processing). Each needs its own surgical reroute / post-process — fixing one doesn't fix the others. 3. Why is methodology lesson #10 worth recording? Prior methodology lessons (#6-#9) covered single-bug cascades. #10 generalises: "single bug class" may need multi-PR surgical fixes when manifest across multiple call sites. 4. Why ≤95% binomial CI is enough confidence to dispatch full 164? The 10-problem sample's 80% is well within the [44%, 97%] CI of the contract floor (84.80% effective). Full 164 dispatch reduces N=10 → N=164 → much tighter CI. 5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive record of 3-PR cascade closure + new empirical evidence; it warrants a minor version bump. Changes (1 spec file + 1 evidence directory): - docs/specifications/aprender-train/ship-two-models-spec.md: - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0 which was claimed by PR #1611 in queue — once that lands, rebase to renumber if needed) - New §62 sub-section ABOVE §61 (newest-first ordering), with 7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement, 62.7 what §62 is NOT - evidence/section-62-branch-a-closure-2026-05-11/ (NEW): - humaneval-10-result.json (raw apr eval --json output) - findings.json (structured 3-PR cascade record + per-problem pass results + dispatch metadata) Validation: - Section format consistent with §61 (newest-first, dated, sub- sections numbered §62.X) - All 3 cascade PRs referenced explicitly - Empirical evidence reproducible via captured JSON Spec movement: - v3.06.0 → v3.08.0 - MODEL-1 ship %: stays at 94% pending 164-run completion - MODEL-2 ship %: unchanged at 57% Refs: - evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence) - PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr) - PR #1616 (HumanEval inference path fix) - PR #1617 (HumanEval indent residual fix — align_continuation_indent) - SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - feedback_compute_pre_authorized.md (lambda-labs 48h ceiling) Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-CODE-MBPP-DIAG-001) The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has ground-truth diagnostics on the same surface. What changes: - run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set. - run_mbpp_inference_cuda (CUDA path) gets the same treatment. What does NOT change: - run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE- SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar:: run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope (also includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP). Out of scope for this PR. - MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP. Why ship this now: - Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers - Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed) - Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes on gx10 Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo check -p apr-cli --features "inference,cuda,training" → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice; will document MBPP failure mode in a §72-class amendment) Refs: - crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - PR #1634 (HumanEval diagnostic surface) - PR #1635 (HumanEval RC3 fix; cascade base for this branch) Closes task #53 (MBPP harness diagnostic extension; renamed from "RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL prompts — that decision recorded in commit body). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-CODE-MBPP-DIAG-001) (#1641) The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has ground-truth diagnostics on the same surface. What changes: - run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set. - run_mbpp_inference_cuda (CUDA path) gets the same treatment. What does NOT change: - run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE- SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar:: run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope (also includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP). Out of scope for this PR. - MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP. Why ship this now: - Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers - Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed) - Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes on gx10 Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo check -p apr-cli --features "inference,cuda,training" → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice; will document MBPP failure mode in a §72-class amendment) Refs: - crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - PR #1634 (HumanEval diagnostic surface) - PR #1635 (HumanEval RC3 fix; cascade base for this branch) Closes task #53 (MBPP harness diagnostic extension; renamed from "RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL prompts — that decision recorded in commit body). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) (#1645) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 11, 2026 07:35

noahgift merged commit 91387c7 into main May 11, 2026
11 checks passed

noahgift deleted the feat/ship-005-fix-discharge branch May 11, 2026 07:58

This was referenced May 11, 2026

fix(apr-cli): align HumanEval raw-continuation indent — closes SHIP-005 whitespace residual #1617

Merged

docs(spec): SHIP-TWO-001 §62 — §61.8 Branch A fully closed across 3 PRs; 80% pass@1 on 10-problem HumanEval sample #1618

Closed

This was referenced May 11, 2026

docs(spec): SHIP-TWO-001 §64 — mid-cascade status snapshot (15-PR cascade summary; gx10 164-run in flight) #1625

Closed

feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness #1641

Merged

noahgift mentioned this pull request May 12, 2026

fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction #1645

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation)#1616

fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation)#1616
noahgift merged 1 commit into
mainfrom
feat/ship-005-fix-discharge

noahgift commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 11, 2026

Summary

Why Not Claim SHIP-005 LIVE-Discharge?

Fix

Validation

Residual (Follow-up PR)

Ship-% Movement

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant