Skip to content

feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness#1641

Merged
noahgift merged 2 commits into
mainfrom
fix/apr-eval-mbpp-rc3-prompt-preamble
May 12, 2026
Merged

feat(apr-cli): extend APR_EVAL_DEBUG diagnostic to MBPP harness#1641
noahgift merged 2 commits into
mainfrom
fix/apr-eval-mbpp-rc3-prompt-preamble

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference + run_mbpp_inference_cuda) was not yet instrumented. This PR extends APR_EVAL_DEBUG to MBPP so future investigation has ground-truth diagnostics on the same surface.

What changes

  • run_mbpp_inference (CPU path) now calls execute_python_test_with_diagnostics and emits /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set.
  • run_mbpp_inference_cuda (CUDA path) gets the same treatment.

What does NOT change

  • run_mbpp_inference still uses the legacy AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE-SHIP-005-FIX (PR fix(apr-cli): route HumanEval inference through run_inference (Branch A continuation) #1616) replaced this for HumanEval with realizar::run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same routing fix — but that's a separate multi-PR cascade scope. Out of scope.
  • MBPP prompts are natural language (not Python signatures), so the §70 RC3 import-stripping bug does NOT apply to MBPP.

Why ship this now

  • Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers
  • Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify the legacy path's failure mode (currently undiagnosed)
  • Mirrors the pattern that diagnosed §69 RC3 in 5 minutes on gx10

Test plan

  • cargo check -p apr-cli --features inference → clean
  • cargo check -p apr-cli --features "inference,cuda,training" → clean
  • cargo fmt --all → clean
  • gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice)

Refs

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 12, 2026 12:28
@noahgift noahgift force-pushed the fix/apr-eval-mbpp-rc3-prompt-preamble branch from 76f3156 to 1f77bb6 Compare May 12, 2026 13:26
…-CODE-MBPP-DIAG-001)

The §69 diagnostic surface (PR #1634) and §70 RC3 fix (PR #1635) closed
the harness-bug class for HumanEval. MBPP's path (run_mbpp_inference
+ run_mbpp_inference_cuda) was not yet instrumented. This PR extends
APR_EVAL_DEBUG to MBPP so future investigation of MBPP failures has
ground-truth diagnostics on the same surface.

What changes:

- run_mbpp_inference (CPU path) now calls
  execute_python_test_with_diagnostics and emits
  /tmp/apr_eval_debug_MBPP_<task>.json when APR_EVAL_DEBUG=1 is set.
- run_mbpp_inference_cuda (CUDA path) gets the same treatment.

What does NOT change:

- run_mbpp_inference still uses the legacy
  AprTransformer::forward_with_cache + AprKVCache path. PMAT-CODE-
  SHIP-005-FIX (PR #1616) replaced this for HumanEval with realizar::
  run_inference + OwnedQuantizedModel::from_apr. MBPP needs the same
  routing fix — but that's a separate multi-PR cascade scope (also
  includes H4 ChatML wrap + R1+R2 extraction equivalents for MBPP).
  Out of scope for this PR.
- MBPP prompts are natural language (not Python signatures), so the
  §70 RC3 import-stripping bug does NOT apply to MBPP.

Why ship this now:

- Pure diagnostic — zero behaviour change for non-APR_EVAL_DEBUG callers
- Lets us run a 1-problem MBPP smoke under APR_EVAL_DEBUG=1 to verify
  the legacy path's failure mode (currently undiagnosed)
- Mirrors the pattern that successfully diagnosed §69 RC3 in 5 minutes
  on gx10

Test plan:

- [x] cargo check -p apr-cli --features inference → clean
- [x] cargo check -p apr-cli --features "inference,cuda,training" → clean
- [x] cargo fmt --all → clean
- [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice;
      will document MBPP failure mode in a §72-class amendment)

Refs:
- crates/apr-cli/src/commands/eval/inference.rs::write_apr_eval_debug
- contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0
- PR #1634 (HumanEval diagnostic surface)
- PR #1635 (HumanEval RC3 fix; cascade base for this branch)

Closes task #53 (MBPP harness diagnostic extension; renamed from
"RC3 prompt-preamble fix" since RC3 does not apply to MBPP's NL
prompts — that decision recorded in commit body).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the fix/apr-eval-mbpp-rc3-prompt-preamble branch from 619a484 to 48976fa Compare May 12, 2026 13:50
@noahgift noahgift merged commit afc9c9f into main May 12, 2026
10 checks passed
@noahgift noahgift deleted the fix/apr-eval-mbpp-rc3-prompt-preamble branch May 12, 2026 14:50
noahgift added a commit that referenced this pull request May 12, 2026
…ode-block extraction (PMAT-CODE-MBPP-H4-FIX)

Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed
via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache
+ AprKVCache` path was producing NL-prose continuations on MBPP prompts
(see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass).

Changes:

- Replace `AprTransformer::forward_with_cache + AprKVCache` loop with
  `realizar::run_inference + InferenceConfig::with_prompt` (ChatML
  auto-wrap for instruct models).
- Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via
  `extract_python_code_block_targeted(&result.text, None)`. MBPP has no
  `entry_point` in the problem schema; first-non-empty-block fallback is
  appropriate.
- Raw-continuation fallback preserved: strip prompt prefix, truncate at
  next top-level def — used when no markdown block found.

Out of scope (vs HumanEval cascade):

- §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python
  function to..."), no Python imports to preserve. `extract_prompt_preamble`
  not applicable.
- §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %.
- Full 500-problem rerun: dispatch as a separate evidence slice.

Test plan:
- [x] cargo check -p apr-cli --features inference → clean
- [x] cargo fmt --all → clean
- [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice)
- [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement

Refs:
- crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror)
- PR #1641 (MBPP diagnostic surface, cascade base)
- evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern)
- project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 12, 2026
…ode-block extraction (PMAT-CODE-MBPP-H4-FIX) (#1645)

Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed
via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache
+ AprKVCache` path was producing NL-prose continuations on MBPP prompts
(see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass).

Changes:

- Replace `AprTransformer::forward_with_cache + AprKVCache` loop with
  `realizar::run_inference + InferenceConfig::with_prompt` (ChatML
  auto-wrap for instruct models).
- Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via
  `extract_python_code_block_targeted(&result.text, None)`. MBPP has no
  `entry_point` in the problem schema; first-non-empty-block fallback is
  appropriate.
- Raw-continuation fallback preserved: strip prompt prefix, truncate at
  next top-level def — used when no markdown block found.

Out of scope (vs HumanEval cascade):

- §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python
  function to..."), no Python imports to preserve. `extract_prompt_preamble`
  not applicable.
- §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %.
- Full 500-problem rerun: dispatch as a separate evidence slice.

Test plan:
- [x] cargo check -p apr-cli --features inference → clean
- [x] cargo fmt --all → clean
- [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice)
- [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement

Refs:
- crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror)
- PR #1641 (MBPP diagnostic surface, cascade base)
- evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern)
- project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant