feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness by noahgift · Pull Request #1641 · paiml/aprender

noahgift · 2026-05-12T12:28:32Z

Summary

The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation has ground-truth diagnostics on the same surface.

What changes

run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set.
run_mbpp_inference_cuda (CUDA path) gets the same treatment.

What does NOT change

run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE-SHIP-005-FIX (PR fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation) #1616) replaced this for HumanEval with realizar::run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope. Out of scope.
MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP.

Why ship this now

Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers
Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed)
Mirrors the pattern that diagnosed §69 RC3 in 5 minutes on gx10

Test plan

cargo check -p apr-cli --features inference → clean
cargo check -p apr-cli --features "inference,cuda,training" → clean
cargo fmt --all → clean
gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice)

Refs

crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug
contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0
PR feat(apr-cli)+contracts: §69 harness diagnostic surface + invariant contract #1634 (HumanEval diagnostic surface, cascade base)
PR fix(apr-cli): §69 RC3 CONFIRMED on gx10 — prepend prompt preamble to HumanEval full_program #1635 (HumanEval RC3 fix)

🤖 Generated with Claude Code

…-CODE-MBPP-DIAG-001) The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has ground-truth diagnostics on the same surface. What changes: - run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set. - run_mbpp_inference_cuda (CUDA path) gets the same treatment. What does NOT change: - run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE- SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar:: run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope (also includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP). Out of scope for this PR. - MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP. Why ship this now: - Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers - Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed) - Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes on gx10 Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo check -p apr-cli --features "inference,cuda,training" → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice; will document MBPP failure mode in a §72-class amendment) Refs: - crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - PR #1634 (HumanEval diagnostic surface) - PR #1635 (HumanEval RC3 fix; cascade base for this branch) Closes task #53 (MBPP harness diagnostic extension; renamed from "RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL prompts — that decision recorded in commit body). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) (#1645) Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache + AprKVCache` path was producing NL-prose continuations on MBPP prompts (see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass). Changes: - Replace `AprTransformer::forward_with_cache + AprKVCache` loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models). - Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via `extract_python_code_block_targeted(&result.text, None)`. MBPP has no `entry_point` in the problem schema; first-non-empty-block fallback is appropriate. - Raw-continuation fallback preserved: strip prompt prefix, truncate at next top-level def — used when no markdown block found. Out of scope (vs HumanEval cascade): - §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python function to..."), no Python imports to preserve. `extract_prompt_preamble` not applicable. - §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %. - Full 500-problem rerun: dispatch as a separate evidence slice. Test plan: - [x] cargo check -p apr-cli --features inference → clean - [x] cargo fmt --all → clean - [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice) - [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement Refs: - crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror) - PR #1641 (MBPP diagnostic surface, cascade base) - evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern) - project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 12, 2026 12:28

noahgift force-pushed the fix/apr-eval-mbpp-rc3-prompt-preamble branch from 76f3156 to 1f77bb6 Compare May 12, 2026 13:26

noahgift force-pushed the fix/apr-eval-mbpp-rc3-prompt-preamble branch from 619a484 to 48976fa Compare May 12, 2026 13:50

Merge branch 'main' into fix/apr-eval-mbpp-rc3-prompt-preamble

7135a28

noahgift merged commit afc9c9f into main May 12, 2026
10 checks passed

noahgift deleted the fix/apr-eval-mbpp-rc3-prompt-preamble branch May 12, 2026 14:50

noahgift mentioned this pull request May 12, 2026

fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction #1645

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness#1641

feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness#1641
noahgift merged 2 commits into
mainfrom
fix/apr-eval-mbpp-rc3-prompt-preamble

noahgift commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 12, 2026

Summary

What changes

What does NOT change

Why ship this now

Test plan

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant