fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation)#1616
Merged
Conversation
…ODE-SHIP-005-FIX) Same Branch A bug class as PR #1615 (SHIP-006 fix). The HumanEval evaluation harness `run_humaneval_inference` was using the legacy `AprTransformer::from_apr_file + forward_with_cache + AprKVCache` path that SHIP-002, SHIP-006, and SHIP-008 LIVE-discharges proved broken on the canonical 7B teacher. Reroute through `realizar::run_inference + InferenceConfig::with_input_tokens` (the working path used by all three prior LIVE-discharges). Five-Whys: 1. Why HumanEval evaluation 0/3 pass on canonical 7B teacher? Same bug class as SHIP-006 golden_output_apr — legacy AprTransformer path produces broken output. 2. Why is AprTransformer broken? Pre-§60 the APR forward path wasn't routed through Q4K+Q8K dispatch; M-FFN-GGUF-5 fix (#1550) updated `forward_traced` but not the standalone `forward_with_cache` path. 3. Why fix the call site? Routing through `run_inference` uses path proven via SHIP-002/006/008 — minimum-risk fix. 4. Why `with_input_tokens` not `with_prompt`? HumanEval prompts are raw Python code with docstrings; passing via `with_prompt` would trigger `prepare_tokens_apr`'s ChatML auto-wrap that would wrap raw Python in `<|im_start|>user...` (off-spec for HumanEval which is raw-continuation evaluation). 5. Why ship this WITHOUT claiming SHIP-005 LIVE discharge? Smoke test shows the model now produces semantically-correct solutions (canonical pairwise comparison for HumanEval/0) but with a leading-whitespace artifact (5-space indent vs expected 4-space). This is a separate residual issue in raw-continuation tokenization that needs its own investigation. The inference-path fix is independently valuable and unblocks the next step. Fix (1 file changed): - `crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference`: - Replace `load_humaneval_model` + `forward_with_cache` + `AprKVCache` + manual sampling loop with `realizar::run_inference` per problem - Use `InferenceConfig::with_input_tokens` to pass pre-tokenized raw-Python prompt (bypasses ChatML auto-wrap) - Slice completion from `result.text` by stripping the prompt prefix, with token-level fallback if text doesn't begin with prompt verbatim LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090): - `apr eval <canonical 7B APR teacher> --task humaneval --data <1-problem> --samples 1 --temperature 0.0 -v`: - Pre-fix: HumanEval/0 → 0/1 pass (broken legacy AprTransformer path) - Post-fix: HumanEval/0 → semantically-correct completion produced (canonical pairwise-comparison `for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) < threshold: return True; return False`), but test still FAILs due to leading-whitespace alignment artifact (5-space vs expected 4-space). - Manual `apr run --prompt <prompt>` on same model produces clean 4-space-indent output — confirms model is healthy and bug is raw-continuation tokenization specific. Validation: - cargo build -p apr-cli --release --features cuda ✓ (clean) - Smoke test: model produces canonical solution structure (verified manually); execute_python_test fails on indentation only Residual (NOT in this PR — separate follow-up): - Leading-whitespace alignment in raw-continuation HumanEval outputs. Model emits ` for i...` (5-space indent) instead of ` for i...` (4-space indent) after ` """\n` prompt suffix. Needs either: (a) post-process completion to normalize indentation, (b) prompt engineering to nudge model toward 4-space, (c) investigate tokenizer's space-prefix behavior at the prompt-completion boundary. This residual blocks SHIP-005 LIVE-discharge; will be addressed in a follow-up PR. Spec movement: - MODEL-1 ship %: unchanged at 94% (infrastructure fix; LIVE discharge of SHIP-005 deferred pending whitespace residual) - MODEL-2 ship %: unchanged at 57% Refs: - crates/apr-cli/src/commands/output_verification.rs:492 (same fix pattern shipped in PR #1615 for golden_output_apr) - contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005 - SPEC-SHIP-TWO-001 §61.8 (Branch A bug class) Closes the infrastructure portion of task #33 PMAT-CODE-SHIP-005-FIX-DISCHARGE. LIVE discharge of SHIP-005 remains a follow-up task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 11, 2026
…05 whitespace residual (#1617) * fix(apr-cli): route HumanEval inference through run_inference (PMAT-CODE-SHIP-005-FIX) Same Branch A bug class as PR #1615 (SHIP-006 fix). The HumanEval evaluation harness `run_humaneval_inference` was using the legacy `AprTransformer::from_apr_file + forward_with_cache + AprKVCache` path that SHIP-002, SHIP-006, and SHIP-008 LIVE-discharges proved broken on the canonical 7B teacher. Reroute through `realizar::run_inference + InferenceConfig::with_input_tokens` (the working path used by all three prior LIVE-discharges). Five-Whys: 1. Why HumanEval evaluation 0/3 pass on canonical 7B teacher? Same bug class as SHIP-006 golden_output_apr — legacy AprTransformer path produces broken output. 2. Why is AprTransformer broken? Pre-§60 the APR forward path wasn't routed through Q4K+Q8K dispatch; M-FFN-GGUF-5 fix (#1550) updated `forward_traced` but not the standalone `forward_with_cache` path. 3. Why fix the call site? Routing through `run_inference` uses path proven via SHIP-002/006/008 — minimum-risk fix. 4. Why `with_input_tokens` not `with_prompt`? HumanEval prompts are raw Python code with docstrings; passing via `with_prompt` would trigger `prepare_tokens_apr`'s ChatML auto-wrap that would wrap raw Python in `<|im_start|>user...` (off-spec for HumanEval which is raw-continuation evaluation). 5. Why ship this WITHOUT claiming SHIP-005 LIVE discharge? Smoke test shows the model now produces semantically-correct solutions (canonical pairwise comparison for HumanEval/0) but with a leading-whitespace artifact (5-space indent vs expected 4-space). This is a separate residual issue in raw-continuation tokenization that needs its own investigation. The inference-path fix is independently valuable and unblocks the next step. Fix (1 file changed): - `crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference`: - Replace `load_humaneval_model` + `forward_with_cache` + `AprKVCache` + manual sampling loop with `realizar::run_inference` per problem - Use `InferenceConfig::with_input_tokens` to pass pre-tokenized raw-Python prompt (bypasses ChatML auto-wrap) - Slice completion from `result.text` by stripping the prompt prefix, with token-level fallback if text doesn't begin with prompt verbatim LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090): - `apr eval <canonical 7B APR teacher> --task humaneval --data <1-problem> --samples 1 --temperature 0.0 -v`: - Pre-fix: HumanEval/0 → 0/1 pass (broken legacy AprTransformer path) - Post-fix: HumanEval/0 → semantically-correct completion produced (canonical pairwise-comparison `for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) < threshold: return True; return False`), but test still FAILs due to leading-whitespace alignment artifact (5-space vs expected 4-space). - Manual `apr run --prompt <prompt>` on same model produces clean 4-space-indent output — confirms model is healthy and bug is raw-continuation tokenization specific. Validation: - cargo build -p apr-cli --release --features cuda ✓ (clean) - Smoke test: model produces canonical solution structure (verified manually); execute_python_test fails on indentation only Residual (NOT in this PR — separate follow-up): - Leading-whitespace alignment in raw-continuation HumanEval outputs. Model emits ` for i...` (5-space indent) instead of ` for i...` (4-space indent) after ` """\n` prompt suffix. Needs either: (a) post-process completion to normalize indentation, (b) prompt engineering to nudge model toward 4-space, (c) investigate tokenizer's space-prefix behavior at the prompt-completion boundary. This residual blocks SHIP-005 LIVE-discharge; will be addressed in a follow-up PR. Spec movement: - MODEL-1 ship %: unchanged at 94% (infrastructure fix; LIVE discharge of SHIP-005 deferred pending whitespace residual) - MODEL-2 ship %: unchanged at 57% Refs: - crates/apr-cli/src/commands/output_verification.rs:492 (same fix pattern shipped in PR #1615 for golden_output_apr) - contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005 - SPEC-SHIP-TWO-001 §61.8 (Branch A bug class) Closes the infrastructure portion of task #33 PMAT-CODE-SHIP-005-FIX-DISCHARGE. LIVE discharge of SHIP-005 remains a follow-up task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(apr-cli): align HumanEval raw-continuation indent (PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL) Closes the whitespace residual flagged by PR #1616. Model emits 1-space over-indent at the prompt-completion boundary on raw- continuation HumanEval prompts (where the prompt ends with ` """\n` and the function body must be at 4-space indent). The BPE tokenizer encodes ` for` (1-leading-space) as a common starting token after a post-docstring `\n`, producing 5-space indent when concatenated. Fix: `align_continuation_indent(prompt, completion)` post-processes the completion before Python execution: 1. Compute prompt's expected continuation indent (last non-empty line's leading-space count). 2. Compute completion's first non-empty line indent. 3. If completion is over-indented by N spaces, dedent every line inside the function body by N. 4. Stop dedenting at the first 0-indent non-empty line (top-level code like `if __name__ == "__main__":` post-amble — preserve its scope). Five-Whys: 1. Why HumanEval/0 FAIL post-PR-#1616? IndentationError on concatenated ` """\n for i...` — 5-space body indent. 2. Why does model emit 5-space? BPE token ` for` (1-leading-space) gets appended after the prompt's `\n`; effective indent is prompt's 4 + token's 1 = 5. 3. Why didn't `apr run` (auto-wrap path) show this? Auto-wrap passes through ChatML which puts the model in assistant role — model writes fresh code with the canonical 4-space indent. Raw-continuation puts the model at the function-body boundary where the tokenizer adds the extra space. 4. Why post-process rather than fix tokenization? Post-processing is the conservative one-PR fix; tokenization changes have a much wider blast radius (would affect every raw-continuation call across the stack). 5. Why scope-track (`in_body` flag) instead of dedenting uniformly? Completions often include top-level post-amble like `if __name__ == "__main__":\n pass`. The ` pass` is at the test-runner's indent level (4), not the function's; if we dedent uniformly, we corrupt the post-amble to ` pass` (3-space — broken Python). Stop dedenting at the first non-empty 0-indent line. LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090): - HumanEval/0 single-problem smoke (~115s): - Pre-fix: pass@1 = 0/1 (IndentationError on 5-space body) - Post-fix: pass@1 = **1/1 = 100%** (canonical pairwise comparison `for i in range(len(numbers)): for j in range(i+1, ...): ...` now Python-executes cleanly) - 6 unit tests added (`align_indent_tests`): - `dedents_one_excess_space` ✓ (the SHIP-005 baseline case) - `passthrough_when_already_correct` ✓ (no-op safety) - `leaves_zero_indent_lines_untouched` ✓ (scope-track safety) - `dedents_multi_space_excess` ✓ (N-space generalisation) - `empty_completion` ✓ (degenerate input safety) - `no_indent_anywhere` ✓ (early-return guard) Fix (1 file changed): - `crates/apr-cli/src/commands/eval/inference.rs`: - + new fn `align_continuation_indent(prompt, completion) -> String` (6-section mutation survey) - Hook into `run_humaneval_inference` after `truncate_at_function_boundary` and before `execute_python_test` Validation: - cargo test -p apr-cli --release --features cuda commands::eval::inference → 6 passed, 0 failed - cargo build -p apr-cli --release --features cuda ✓ (clean) - LIVE HumanEval/0 1/1 PASS Spec movement (DEFERRED, not in this PR): - This is the LAST infrastructure blocker for SHIP-005 LIVE discharge. - Full 164-problem run on canonical 7B teacher dispatched separately. - Once SHIP-005 LIVE-discharges: MODEL-1 ship % 94% → 95%. Refs: - crates/apr-cli/src/commands/output_verification.rs:492 (PR #1615 — sibling fix) - crates/apr-cli/src/commands/eval/inference.rs (PR #1616 — eval inference path fix) - contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005 - SPEC-SHIP-TWO-001 §61.8 (Branch A bug class) Closes task #34 PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 11, 2026
…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62) Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across THREE same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005. Branch A closure pattern (3 PRs, same defect class, 3 call sites): - PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr Reroute through realizar::run_inference + with_input_tokens. Discharge: SHIP-006 LIVE (apr qa 12/12 gates). - PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference Reroute through same path. Model emits canonical solution structure but Python test FAILs on whitespace artifact. - PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent NEW post-processing fn: dedent over-indented body by N spaces; stop at first 0-indent non-empty line (preserve post-amble). Discharge: HumanEval/0 1/1 PASS post-fix. LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090): - apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0 - Result: passed = 8/10 = 80% pass@1 - Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL - 95% binomial CI on 8/10: [44%, 97%] — within statistical noise of 86% nominal SHIP-005 floor - Full 164-problem run dispatched in background (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per feedback_compute_pre_authorized.md 48h ceiling) Five-Whys for the §62 amendment: 1. Why §62 now and not wait for 164 result? The 3-PR closure is a substantial cascade record that deserves spec-level permanence; 164-result is a separate "ship-%-flip" event that gets its own follow-up amendment when it lands. 2. Why 3 PRs for one bug class? The legacy AprTransformer path was wired in 3 distinct callsites (golden_output, humaneval, indent-residual post-processing). Each needs its own surgical reroute / post-process — fixing one doesn't fix the others. 3. Why is methodology lesson #10 worth recording? Prior methodology lessons (#6-#9) covered single-bug cascades. #10 generalises: "single bug class" may need multi-PR surgical fixes when manifest across multiple call sites. 4. Why ≤95% binomial CI is enough confidence to dispatch full 164? The 10-problem sample's 80% is well within the [44%, 97%] CI of the contract floor (84.80% effective). Full 164 dispatch reduces N=10 → N=164 → much tighter CI. 5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive record of 3-PR cascade closure + new empirical evidence; it warrants a minor version bump. Changes (1 spec file + 1 evidence directory): - docs/specifications/aprender-train/ship-two-models-spec.md: - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0 which was claimed by PR #1611 in queue — once that lands, rebase to renumber if needed) - New §62 sub-section ABOVE §61 (newest-first ordering), with 7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement, 62.7 what §62 is NOT - evidence/section-62-branch-a-closure-2026-05-11/ (NEW): - humaneval-10-result.json (raw apr eval --json output) - findings.json (structured 3-PR cascade record + per-problem pass results + dispatch metadata) Validation: - Section format consistent with §61 (newest-first, dated, sub- sections numbered §62.X) - All 3 cascade PRs referenced explicitly - Empirical evidence reproducible via captured JSON Spec movement: - v3.06.0 → v3.08.0 - MODEL-1 ship %: stays at 94% pending 164-run completion - MODEL-2 ship %: unchanged at 57% Refs: - evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence) - PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr) - PR #1616 (HumanEval inference path fix) - PR #1617 (HumanEval indent residual fix — align_continuation_indent) - SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - feedback_compute_pre_authorized.md (lambda-labs 48h ceiling) Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 11, 2026
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…-CODE-MBPP-DIAG-001) The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has ground-truth diagnostics on the same surface. What changes: - run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set. - run_mbpp_inference_cuda (CUDA path) gets the same treatment. What does NOT change: - run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE- SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar:: run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope (also includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP). Out of scope for this PR. - MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP. Why ship this now: - Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers - Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed) - Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes on gx10 Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo check -p apr-cli --features "inference,cuda,training" → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice; will document MBPP failure mode in a §72-class amendment) Refs: - crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - PR #1634 (HumanEval diagnostic surface) - PR #1635 (HumanEval RC3 fix; cascade base for this branch) Closes task #53 (MBPP harness diagnostic extension; renamed from "RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL prompts — that decision recorded in commit body). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…-CODE-MBPP-DIAG-001) The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has ground-truth diagnostics on the same surface. What changes: - run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set. - run_mbpp_inference_cuda (CUDA path) gets the same treatment. What does NOT change: - run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE- SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar:: run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope (also includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP). Out of scope for this PR. - MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP. Why ship this now: - Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers - Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed) - Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes on gx10 Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo check -p apr-cli --features "inference,cuda,training" → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice; will document MBPP failure mode in a §72-class amendment) Refs: - crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - PR #1634 (HumanEval diagnostic surface) - PR #1635 (HumanEval RC3 fix; cascade base for this branch) Closes task #53 (MBPP harness diagnostic extension; renamed from "RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL prompts — that decision recorded in commit body). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…-CODE-MBPP-DIAG-001) (#1641) The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has ground-truth diagnostics on the same surface. What changes: - run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set. - run_mbpp_inference_cuda (CUDA path) gets the same treatment. What does NOT change: - run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE- SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar:: run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope (also includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP). Out of scope for this PR. - MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP. Why ship this now: - Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers - Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed) - Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes on gx10 Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo check -p apr-cli --features "inference,cuda,training" → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice; will document MBPP failure mode in a §72-class amendment) Refs: - crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - PR #1634 (HumanEval diagnostic surface) - PR #1635 (HumanEval RC3 fix; cascade base for this branch) Closes task #53 (MBPP harness diagnostic extension; renamed from "RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL prompts — that decision recorded in commit body). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) (#1645) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continuation of the §61.8 Branch A bug-class fix from PR #1615 (SHIP-006). The HumanEval evaluation harness
run_humaneval_inferencewas using the same broken legacyAprTransformer + forward_with_cache + AprKVCachepath. Reroute throughrealizar::run_inference + InferenceConfig::with_input_tokens(the working path used by SHIP-002/006/008 LIVE-discharges).Why Not Claim SHIP-005 LIVE-Discharge?
Smoke test on
HumanEval/0shows:for i in range(len(numbers)): for j in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) < threshold: return True; return False), BUT the test still fails due to a leading-whitespace alignment artifact:for i...(5-space indent) instead offor i...(4-space)"""\n→ 4-space indent for docstring close + 5-space for-loop body → Python IndentationErrorThis is a separate residual issue in raw-continuation tokenization at the prompt-completion boundary. Manual
apr runon the same model with auto-wrap produces clean 4-space output, so the model itself is healthy.Fix
crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference:load_humaneval_model+forward_with_cache+AprKVCache+ manual sampling withrealizar::run_inferenceper problemInferenceConfig::with_input_tokensto pass pre-tokenized raw Python (bypasses ChatML auto-wrap — HumanEval is raw-continuation, not chat)Validation
cargo build -p apr-cli --release --features cuda— cleangolden_output_apr(SHIP-006)Residual (Follow-up PR)
Leading-whitespace alignment in raw-continuation HumanEval outputs. Three possible fixes:
Ship-% Movement
🤖 Generated with Claude Code