fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction by noahgift · Pull Request #1645 · paiml/aprender

noahgift · 2026-05-12T15:43:11Z

Summary

Mirrors the §70 HumanEval cascade (PRs #1616 / #1628 / #1635) for MBPP. The legacy AprTransformer::forward_with_cache + AprKVCache path was producing NL-prose continuations on MBPP prompts (PR #1641 MBPP/11 smoke: `SyntaxError` on "Example:" prose, 0/1 pass).

Changes

Replace legacy loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models)
Parse ```python ... ``` markdown blocks via `extract_python_code_block_targeted(&result.text, None)` — MBPP has no `entry_point`, first-non-empty-block fallback
Raw-continuation fallback preserved when no markdown block found

Out of scope

§70 RC3 prompt-preamble (MBPP prompts are NL, no imports to preserve)
§17.5 chain impact (MBPP is not a §17.5 row; ship % unchanged)
Full 500-problem rerun (dispatch as separate evidence slice)

Test plan

`cargo check -p apr-cli --features inference` → clean
`cargo fmt --all` → clean
gx10 MBPP/11 `APR_EVAL_DEBUG=1` smoke (expect pass after fix)
gx10 sanitized-subset MBPP rerun

Refs

HumanEval cascade pattern: PRs fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation) #1616 / fix(apr-cli): route HumanEval through ChatML for instruct models — H4 SHIP-005 fix #1628 / fix(apr-cli): §69 RC3 CONFIRMED on gx10 — prepend prompt preamble to HumanEval full_program #1635
MBPP diagnostic surface: PR feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness #1641
Finding: `project_2026_05_12_mbpp_legacy_path_finding.md`

🤖 Generated with Claude Code

…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T16:09:14Z

Empirical evidence (gx10, 2026-05-12)

Pre-fix (legacy AprTransformer path): 0/1 pass@1 on MBPP/11 — SyntaxError on "Example:" prose (PR #1641 evidence).

Post-fix (H4 + ChatML + test-hint, this PR): 4/5 pass@1 on MBPP/11-15 smoke:

Task	Pre-fix	Post-fix
MBPP/11	FAIL (SyntaxError)	PASS (exit_code=0)
MBPP/12	not tested	PASS
MBPP/13	not tested	FAIL (assertion mismatch — model-quality, NOT harness)
MBPP/14	not tested	PASS
MBPP/15	not tested	PASS

Effective rate: 80% pass@1 on 5-problem smoke. Mirrors the HumanEval H4 80.49% jump from §67. Diagnostic JSONs show success: true for 4/5 and a clean Python assertion failure for MBPP/13 (no harness false-negative).

Confirmed: this PR fixes the harness layer of MBPP. Remaining failures are model-quality (similar to HumanEval at H4 stage).

…T-CODE-V0-33-0-RELEASE-PREP) 🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001. All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090, --features cuda). This release prep PR ships: 1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights: - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE) - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59% - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634) - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649) - Added: MBPP harness H4 fix (PR #1645) - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness- invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0) - Methodology lessons #16-22 captured in MEMORY.md - Spec: v3.13.0 → v3.21.0 across §67-§75 2. Workspace version bump: - [workspace.package].version: 0.32.0 → 0.33.0 - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0 - 28 sub-crate version literals: 0.32.0 → 0.33.0 3. `cargo check -p aprender` → clean (workspace builds at 0.33.0). Out of scope for this PR (separate steps after #1651/1652 land + this PR lands): - Tag release `v0.33.0` on main - Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md — 15 user-facing crates + 7 internal-tier in topological dependency order; uses `make publish CRATE=<name>`) - Post-publish QA per `feedback_post_publish_qa_required.md` — `cargo install aprender --force` + `/dogfood` GO verdict required before declaring release done (v0.31.1 was yanked for skipping this) - GitHub Release with §75 narrative - HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256 already verified by §72 SHIP-010 LIVE evidence; double-check before release announcement) This PR ships ONLY the version-bump + CHANGELOG. Publishing is the next step after merge. Refs: - §75 MODEL-1 100% (PR #1652) - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - §72 5-AC LIVE cascade (PR #1646) - §71 SHIP-005 LIVE-DISCHARGED (PR #1642) - §70 RC3 fix (PR #1636) - §69 Q4K hypothesis falsified (PR #1633) - PR #1635 RC3 prepend - PR #1634 diagnostic surface + contract - PR #1648 SHIP-007 contract scaffold - PR #1649 SHIP-007 PR-B stage dump - PR #1651 SHIP-007 PR-E F32 GEMV layout fix Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…T-CODE-V0-33-0-RELEASE-PREP) (#1653) 🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001. All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090, --features cuda). This release prep PR ships: 1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights: - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE) - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59% - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634) - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649) - Added: MBPP harness H4 fix (PR #1645) - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness- invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0) - Methodology lessons #16-22 captured in MEMORY.md - Spec: v3.13.0 → v3.21.0 across §67-§75 2. Workspace version bump: - [workspace.package].version: 0.32.0 → 0.33.0 - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0 - 28 sub-crate version literals: 0.32.0 → 0.33.0 3. `cargo check -p aprender` → clean (workspace builds at 0.33.0). Out of scope for this PR (separate steps after #1651/1652 land + this PR lands): - Tag release `v0.33.0` on main - Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md — 15 user-facing crates + 7 internal-tier in topological dependency order; uses `make publish CRATE=<name>`) - Post-publish QA per `feedback_post_publish_qa_required.md` — `cargo install aprender --force` + `/dogfood` GO verdict required before declaring release done (v0.31.1 was yanked for skipping this) - GitHub Release with §75 narrative - HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256 already verified by §72 SHIP-010 LIVE evidence; double-check before release announcement) This PR ships ONLY the version-bump + CHANGELOG. Publishing is the next step after merge. Refs: - §75 MODEL-1 100% (PR #1652) - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - §72 5-AC LIVE cascade (PR #1646) - §71 SHIP-005 LIVE-DISCHARGED (PR #1642) - §70 RC3 fix (PR #1636) - §69 Q4K hypothesis falsified (PR #1633) - PR #1635 RC3 prepend - PR #1634 diagnostic surface + contract - PR #1648 SHIP-007 contract scaffold - PR #1649 SHIP-007 PR-B stage dump - PR #1651 SHIP-007 PR-E F32 GEMV layout fix Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 12, 2026 15:43

noahgift force-pushed the fix/apr-eval-mbpp-h4-chatml branch from b3876b7 to 1163d0a Compare May 12, 2026 15:50

noahgift merged commit e5778cc into main May 12, 2026
10 checks passed

noahgift deleted the fix/apr-eval-mbpp-h4-chatml branch May 12, 2026 17:46

noahgift mentioned this pull request May 13, 2026

🎉 chore: v0.33.0 release prep — CHANGELOG + workspace version bump (MODEL-1 100%) #1653

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction#1645

fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction#1645
noahgift merged 1 commit into
mainfrom
fix/apr-eval-mbpp-h4-chatml

noahgift commented May 12, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 12, 2026

Summary

Changes

Out of scope

Test plan

Refs

Uh oh!

noahgift commented May 12, 2026

Empirical evidence (gx10, 2026-05-12)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant