Skip to content

fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction#1645

Merged
noahgift merged 1 commit into
mainfrom
fix/apr-eval-mbpp-h4-chatml
May 12, 2026
Merged

fix(apr-cli): route MBPP through realizar::run_inference + ChatML + code extraction#1645
noahgift merged 1 commit into
mainfrom
fix/apr-eval-mbpp-h4-chatml

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Mirrors the §70 HumanEval cascade (PRs #1616 / #1628 / #1635) for MBPP. The legacy AprTransformer::forward_with_cache + AprKVCache path was producing NL-prose continuations on MBPP prompts (PR #1641 MBPP/11 smoke: `SyntaxError` on "Example:" prose, 0/1 pass).

Changes

  • Replace legacy loop with `realizar::run_inference + InferenceConfig::with_prompt` (ChatML auto-wrap for instruct models)
  • Parse ```python ... ``` markdown blocks via `extract_python_code_block_targeted(&result.text, None)` — MBPP has no `entry_point`, first-non-empty-block fallback
  • Raw-continuation fallback preserved when no markdown block found

Out of scope

  • §70 RC3 prompt-preamble (MBPP prompts are NL, no imports to preserve)
  • §17.5 chain impact (MBPP is not a §17.5 row; ship % unchanged)
  • Full 500-problem rerun (dispatch as separate evidence slice)

Test plan

  • `cargo check -p apr-cli --features inference` → clean
  • `cargo fmt --all` → clean
  • gx10 MBPP/11 `APR_EVAL_DEBUG=1` smoke (expect pass after fix)
  • gx10 sanitized-subset MBPP rerun

Refs

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 12, 2026 15:43
…ode-block extraction (PMAT-CODE-MBPP-H4-FIX)

Mirrors the §70 HumanEval H4 + R1+R2 cascade (PRs #1616, #1628 squashed
via #1634/#1635) for MBPP. The legacy `AprTransformer::forward_with_cache
+ AprKVCache` path was producing NL-prose continuations on MBPP prompts
(see PR #1641 MBPP/11 smoke: SyntaxError on "Example:" prose, 0/1 pass).

Changes:

- Replace `AprTransformer::forward_with_cache + AprKVCache` loop with
  `realizar::run_inference + InferenceConfig::with_prompt` (ChatML
  auto-wrap for instruct models).
- Parse `\`\`\`python ... \`\`\`` markdown blocks from the response via
  `extract_python_code_block_targeted(&result.text, None)`. MBPP has no
  `entry_point` in the problem schema; first-non-empty-block fallback is
  appropriate.
- Raw-continuation fallback preserved: strip prompt prefix, truncate at
  next top-level def — used when no markdown block found.

Out of scope (vs HumanEval cascade):

- §70 RC3 prompt-preamble handling: MBPP prompts are NL ("Write a python
  function to..."), no Python imports to preserve. `extract_prompt_preamble`
  not applicable.
- §17.5 chain impact: MBPP is not in §17.5; this PR does not move ship %.
- Full 500-problem rerun: dispatch as a separate evidence slice.

Test plan:
- [x] cargo check -p apr-cli --features inference → clean
- [x] cargo fmt --all → clean
- [ ] gx10 single-MBPP-problem APR_EVAL_DEBUG=1 smoke (next slice)
- [ ] gx10 sanitized-subset MBPP rerun for pass@1 measurement

Refs:
- crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference (mirror)
- PR #1641 (MBPP diagnostic surface, cascade base)
- evidence/section-71-ship-005-discharged-2026-05-12/ (HumanEval cascade pattern)
- project_2026_05_12_mbpp_legacy_path_finding.md (cascade scope)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the fix/apr-eval-mbpp-h4-chatml branch from b3876b7 to 1163d0a Compare May 12, 2026 15:50
@noahgift
Copy link
Copy Markdown
Contributor Author

Empirical evidence (gx10, 2026-05-12)

Pre-fix (legacy AprTransformer path): 0/1 pass@1 on MBPP/11 — SyntaxError on "Example:" prose (PR #1641 evidence).

Post-fix (H4 + ChatML + test-hint, this PR): 4/5 pass@1 on MBPP/11-15 smoke:

Task Pre-fix Post-fix
MBPP/11 FAIL (SyntaxError) PASS (exit_code=0)
MBPP/12 not tested PASS
MBPP/13 not tested FAIL (assertion mismatch — model-quality, NOT harness)
MBPP/14 not tested PASS
MBPP/15 not tested PASS

Effective rate: 80% pass@1 on 5-problem smoke. Mirrors the HumanEval H4 80.49% jump from §67. Diagnostic JSONs show success: true for 4/5 and a clean Python assertion failure for MBPP/13 (no harness false-negative).

Confirmed: this PR fixes the harness layer of MBPP. Remaining failures are model-quality (similar to HumanEval at H4 stage).

@noahgift noahgift merged commit e5778cc into main May 12, 2026
10 checks passed
@noahgift noahgift deleted the fix/apr-eval-mbpp-h4-chatml branch May 12, 2026 17:46
noahgift added a commit that referenced this pull request May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP)

🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001.

All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical
7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090,
--features cuda).

This release prep PR ships:
1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights:
   - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE)
   - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s
   - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59%
   - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634)
   - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649)
   - Added: MBPP harness H4 fix (PR #1645)
   - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness-
     invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0)
   - Methodology lessons #16-22 captured in MEMORY.md
   - Spec: v3.13.0 → v3.21.0 across §67-§75

2. Workspace version bump:
   - [workspace.package].version: 0.32.0 → 0.33.0
   - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0
   - 28 sub-crate version literals: 0.32.0 → 0.33.0

3. `cargo check -p aprender` → clean (workspace builds at 0.33.0).

Out of scope for this PR (separate steps after #1651/1652 land + this
PR lands):
- Tag release `v0.33.0` on main
- Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md
  — 15 user-facing crates + 7 internal-tier in topological dependency
  order; uses `make publish CRATE=<name>`)
- Post-publish QA per `feedback_post_publish_qa_required.md` —
  `cargo install aprender --force` + `/dogfood` GO verdict required
  before declaring release done (v0.31.1 was yanked for skipping this)
- GitHub Release with §75 narrative
- HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256
  already verified by §72 SHIP-010 LIVE evidence; double-check before
  release announcement)

This PR ships ONLY the version-bump + CHANGELOG. Publishing is the
next step after merge.

Refs:
- §75 MODEL-1 100% (PR #1652)
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- §72 5-AC LIVE cascade (PR #1646)
- §71 SHIP-005 LIVE-DISCHARGED (PR #1642)
- §70 RC3 fix (PR #1636)
- §69 Q4K hypothesis falsified (PR #1633)
- PR #1635 RC3 prepend
- PR #1634 diagnostic surface + contract
- PR #1648 SHIP-007 contract scaffold
- PR #1649 SHIP-007 PR-B stage dump
- PR #1651 SHIP-007 PR-E F32 GEMV layout fix

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) (#1653)

🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001.

All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical
7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090,
--features cuda).

This release prep PR ships:
1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights:
   - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE)
   - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s
   - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59%
   - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634)
   - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649)
   - Added: MBPP harness H4 fix (PR #1645)
   - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness-
     invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0)
   - Methodology lessons #16-22 captured in MEMORY.md
   - Spec: v3.13.0 → v3.21.0 across §67-§75

2. Workspace version bump:
   - [workspace.package].version: 0.32.0 → 0.33.0
   - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0
   - 28 sub-crate version literals: 0.32.0 → 0.33.0

3. `cargo check -p aprender` → clean (workspace builds at 0.33.0).

Out of scope for this PR (separate steps after #1651/1652 land + this
PR lands):
- Tag release `v0.33.0` on main
- Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md
  — 15 user-facing crates + 7 internal-tier in topological dependency
  order; uses `make publish CRATE=<name>`)
- Post-publish QA per `feedback_post_publish_qa_required.md` —
  `cargo install aprender --force` + `/dogfood` GO verdict required
  before declaring release done (v0.31.1 was yanked for skipping this)
- GitHub Release with §75 narrative
- HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256
  already verified by §72 SHIP-010 LIVE evidence; double-check before
  release announcement)

This PR ships ONLY the version-bump + CHANGELOG. Publishing is the
next step after merge.

Refs:
- §75 MODEL-1 100% (PR #1652)
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- §72 5-AC LIVE cascade (PR #1646)
- §71 SHIP-005 LIVE-DISCHARGED (PR #1642)
- §70 RC3 fix (PR #1636)
- §69 Q4K hypothesis falsified (PR #1633)
- PR #1635 RC3 prepend
- PR #1634 diagnostic surface + contract
- PR #1648 SHIP-007 contract scaffold
- PR #1649 SHIP-007 PR-B stage dump
- PR #1651 SHIP-007 PR-E F32 GEMV layout fix

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant