From 300a08f86799400ae921fab9a60437a4380f2dd0 Mon Sep 17 00:00:00 2001
From: Noah Gift <noah.gift@gmail.com>
Date: Mon, 11 May 2026 09:33:54 +0200
Subject: [PATCH 1/2] fix(apr-cli): route HumanEval inference through
 run_inference (PMAT-CODE-SHIP-005-FIX)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Same Branch A bug class as PR #1615 (SHIP-006 fix). The HumanEval
evaluation harness `run_humaneval_inference` was using the legacy
`AprTransformer::from_apr_file + forward_with_cache + AprKVCache`
path that SHIP-002, SHIP-006, and SHIP-008 LIVE-discharges proved
broken on the canonical 7B teacher. Reroute through
`realizar::run_inference + InferenceConfig::with_input_tokens`
(the working path used by all three prior LIVE-discharges).

Five-Whys:
1. Why HumanEval evaluation 0/3 pass on canonical 7B teacher? Same
   bug class as SHIP-006 golden_output_apr — legacy AprTransformer
   path produces broken output.
2. Why is AprTransformer broken? Pre-§60 the APR forward path
   wasn't routed through Q4K+Q8K dispatch; M-FFN-GGUF-5 fix
   (#1550) updated `forward_traced` but not the standalone
   `forward_with_cache` path.
3. Why fix the call site? Routing through `run_inference` uses
   path proven via SHIP-002/006/008 — minimum-risk fix.
4. Why `with_input_tokens` not `with_prompt`? HumanEval prompts
   are raw Python code with docstrings; passing via `with_prompt`
   would trigger `prepare_tokens_apr`'s ChatML auto-wrap that
   would wrap raw Python in `<|im_start|>user...` (off-spec for
   HumanEval which is raw-continuation evaluation).
5. Why ship this WITHOUT claiming SHIP-005 LIVE discharge? Smoke
   test shows the model now produces semantically-correct
   solutions (canonical pairwise comparison for HumanEval/0) but
   with a leading-whitespace artifact (5-space indent vs expected
   4-space). This is a separate residual issue in raw-continuation
   tokenization that needs its own investigation. The
   inference-path fix is independently valuable and unblocks the
   next step.

Fix (1 file changed):
- `crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference`:
  - Replace `load_humaneval_model` + `forward_with_cache` + `AprKVCache`
    + manual sampling loop with `realizar::run_inference` per problem
  - Use `InferenceConfig::with_input_tokens` to pass pre-tokenized
    raw-Python prompt (bypasses ChatML auto-wrap)
  - Slice completion from `result.text` by stripping the prompt
    prefix, with token-level fallback if text doesn't begin with
    prompt verbatim

LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090):
- `apr eval <canonical 7B APR teacher> --task humaneval --data <1-problem>
  --samples 1 --temperature 0.0 -v`:
  - Pre-fix: HumanEval/0 → 0/1 pass (broken legacy AprTransformer path)
  - Post-fix: HumanEval/0 → semantically-correct completion produced
    (canonical pairwise-comparison `for i in range(len(numbers)): for j
    in range(i+1, len(numbers)): if abs(numbers[i]-numbers[j]) <
    threshold: return True; return False`), but test still FAILs due to
    leading-whitespace alignment artifact (5-space vs expected 4-space).
- Manual `apr run --prompt <prompt>` on same model produces clean
  4-space-indent output — confirms model is healthy and bug is
  raw-continuation tokenization specific.

Validation:
- cargo build -p apr-cli --release --features cuda ✓ (clean)
- Smoke test: model produces canonical solution structure (verified
  manually); execute_python_test fails on indentation only

Residual (NOT in this PR — separate follow-up):
- Leading-whitespace alignment in raw-continuation HumanEval outputs.
  Model emits ` for i...` (5-space indent) instead of `    for i...`
  (4-space indent) after `    """\n` prompt suffix. Needs either:
  (a) post-process completion to normalize indentation,
  (b) prompt engineering to nudge model toward 4-space,
  (c) investigate tokenizer's space-prefix behavior at the
      prompt-completion boundary.
  This residual blocks SHIP-005 LIVE-discharge; will be addressed
  in a follow-up PR.

Spec movement:
- MODEL-1 ship %: unchanged at 94% (infrastructure fix; LIVE
  discharge of SHIP-005 deferred pending whitespace residual)
- MODEL-2 ship %: unchanged at 57%

Refs:
- crates/apr-cli/src/commands/output_verification.rs:492 (same fix
  pattern shipped in PR #1615 for golden_output_apr)
- contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005
- SPEC-SHIP-TWO-001 §61.8 (Branch A bug class)

Closes the infrastructure portion of task #33 PMAT-CODE-SHIP-005-FIX-DISCHARGE.
LIVE discharge of SHIP-005 remains a follow-up task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 crates/apr-cli/src/commands/eval/inference.rs | 89 +++++++++++--------
 1 file changed, 50 insertions(+), 39 deletions(-)
diff --git a/crates/apr-cli/src/commands/eval/inference.rs b/crates/apr-cli/src/commands/eval/inference.rs
index 71475f04a..9b8566d2f 100644
--- a/crates/apr-cli/src/commands/eval/inference.rs
+++ b/crates/apr-cli/src/commands/eval/inference.rs
@@ -274,6 +274,19 @@ fn load_humaneval_tokenizer(
 }
 
 /// ALB-084: Run HumanEval with actual model inference + Python test execution.
+///
+/// PMAT-CODE-SHIP-005-FIX (2026-05-11): routed through `realizar::run_inference`
+/// + `OwnedQuantizedModel::from_apr` (the same working path that SHIP-002 +
+/// SHIP-006 + SHIP-008 LIVE-discharged) instead of the legacy
+/// `AprTransformer::forward_with_cache + AprKVCache` path. The legacy path
+/// produced 0/3 pass@1 on canonical 7B teacher smoke test (every problem
+/// FAIL); the run_inference path produces the canonical pairwise-comparison
+/// solution for HumanEval/0 (verified manually 2026-05-11 via `apr run`).
+///
+/// HumanEval prompts are raw Python code (with docstrings); we tokenize via
+/// embedded BPE and pass via `InferenceConfig::with_input_tokens` to bypass
+/// `prepare_tokens_apr`'s ChatML auto-wrap (which would wrap raw Python in
+/// `<|im_start|>user...` causing degenerate output).
 #[cfg(feature = "inference")]
 fn run_humaneval_inference(
     model_path: &Path,
@@ -281,27 +294,19 @@ fn run_humaneval_inference(
     _k_values: &[usize],
     json_output: bool,
 ) -> std::result::Result<(usize, Vec<(String, String, bool)>), String> {
-    use realizar::apr_transformer::AprKVCache;
+    use realizar::{run_inference, InferenceConfig};
 
     if !json_output {
         println!("  {} Loading model for inference...", "→".dimmed());
     }
-    let transformer = load_humaneval_model(model_path)?;
     let tokenizer = load_humaneval_tokenizer(model_path, json_output)?;
 
     if !json_output {
-        println!(
-            "  {} Model loaded ({} layers, vocab={})",
-            "✓".green(),
-            transformer.config.num_layers,
-            transformer.config.vocab_size
-        );
+        println!("  {} Tokenizer loaded", "✓".green());
     }
 
     let mut passed = 0usize;
     let mut results = Vec::new();
-    let temperature = 0.0f32;
-    let mut rng_state: u64 = 42;
 
     for (i, problem) in problems.iter().enumerate() {
         let entry = problem
@@ -316,37 +321,43 @@ fn run_humaneval_inference(
             continue;
         }
 
-        // Generate completion (greedy, max 256 tokens)
-        let mut cache = AprKVCache::new(&transformer.config);
-        let mut tokens = prompt_tokens.clone();
-
-        for (pos, &tok) in prompt_tokens.iter().enumerate() {
-            let _ = transformer.forward_with_cache(tok, &mut cache, pos);
-        }
-
-        let max_new = 256;
-        for step in 0..max_new {
-            let pos = prompt_tokens.len() + step;
-            let last_tok = *tokens.last().expect("last(");
-            let logits = transformer
-                .forward_with_cache(last_tok, &mut cache, pos)
-                .map_err(|e| format!("Generation failed: {e}"))?;
-
-            let next = sample_token(&logits, temperature, &mut rng_state);
-            tokens.push(next);
-
-            if next == 0 {
-                break;
-            }
-            if let Some(eos) = transformer.config.eos_token_id {
-                if next == eos {
-                    break;
+        // Generate completion via run_inference (greedy, max 256 tokens).
+        // `with_input_tokens` bypasses `prepare_tokens_apr`'s ChatML auto-wrap
+        // — HumanEval prompts are raw Python and must NOT be wrapped.
+        let config = InferenceConfig::new(model_path)
+            .with_input_tokens(prompt_tokens.clone())
+            .with_max_tokens(256)
+            .with_temperature(0.0)
+            .with_top_k(1);
+
+        let result = match run_inference(&config) {
+            Ok(r) => r,
+            Err(e) => {
+                if !json_output {
+                    eprintln!("  [FAIL] {} ({}): inference error: {e}", problem.task_id, entry);
                 }
-            }
-        }
+                results.push((problem.task_id.clone(), entry.to_string(), false));
+                continue;
+            },
+        };
 
-        let completion_tokens = &tokens[prompt_tokens.len()..];
-        let completion = tokenizer.decode(completion_tokens);
+        // run_inference's `result.text` is the FULL decoded sequence
+        // (prompt + completion). Slicing by the prompt string preserves
+        // exact byte boundaries — slicing by tokens introduces a leading-
+        // whitespace artifact when the prompt ends with `\n` and the
+        // first generated token decodes as a leading-space-prefixed run.
+        let completion = if let Some(stripped) = result.text.strip_prefix(&problem.prompt) {
+            stripped.to_string()
+        } else {
+            // Fallback: token-level slicing if text doesn't begin with the
+            // prompt verbatim (e.g., tokenizer-specific whitespace handling).
+            let completion_tokens = if result.tokens.len() > result.input_token_count {
+                &result.tokens[result.input_token_count..]
+            } else {
+                &result.tokens[..]
+            };
+            tokenizer.decode(completion_tokens)
+        };
         let completion = truncate_at_function_boundary(&completion);
 
         let full_program = format!(

From 0c5fa9a5c1b0034fa042ea931bd9ae25f01597fa Mon Sep 17 00:00:00 2001
From: Noah Gift <noah.gift@gmail.com>
Date: Mon, 11 May 2026 09:59:42 +0200
Subject: [PATCH 2/2] fix(apr-cli): align HumanEval raw-continuation indent
 (PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes the whitespace residual flagged by PR #1616. Model emits
1-space over-indent at the prompt-completion boundary on raw-
continuation HumanEval prompts (where the prompt ends with `    """\n`
and the function body must be at 4-space indent). The BPE tokenizer
encodes ` for` (1-leading-space) as a common starting token after a
post-docstring `\n`, producing 5-space indent when concatenated.

Fix: `align_continuation_indent(prompt, completion)` post-processes
the completion before Python execution:
1. Compute prompt's expected continuation indent (last non-empty
   line's leading-space count).
2. Compute completion's first non-empty line indent.
3. If completion is over-indented by N spaces, dedent every line
   inside the function body by N.
4. Stop dedenting at the first 0-indent non-empty line (top-level
   code like `if __name__ == "__main__":` post-amble — preserve
   its scope).

Five-Whys:
1. Why HumanEval/0 FAIL post-PR-#1616? IndentationError on
   concatenated `    """\n     for i...` — 5-space body indent.
2. Why does model emit 5-space? BPE token ` for` (1-leading-space)
   gets appended after the prompt's `\n`; effective indent is
   prompt's 4 + token's 1 = 5.
3. Why didn't `apr run` (auto-wrap path) show this? Auto-wrap
   passes through ChatML which puts the model in assistant role
   — model writes fresh code with the canonical 4-space indent.
   Raw-continuation puts the model at the function-body boundary
   where the tokenizer adds the extra space.
4. Why post-process rather than fix tokenization? Post-processing
   is the conservative one-PR fix; tokenization changes have a
   much wider blast radius (would affect every raw-continuation
   call across the stack).
5. Why scope-track (`in_body` flag) instead of dedenting
   uniformly? Completions often include top-level post-amble like
   `if __name__ == "__main__":\n    pass`. The `    pass` is at
   the test-runner's indent level (4), not the function's; if we
   dedent uniformly, we corrupt the post-amble to `   pass`
   (3-space — broken Python). Stop dedenting at the first
   non-empty 0-indent line.

LIVE Evidence (2026-05-11, noah-Lambda-Vector RTX 4090):
- HumanEval/0 single-problem smoke (~115s):
  - Pre-fix: pass@1 = 0/1 (IndentationError on 5-space body)
  - Post-fix: pass@1 = **1/1 = 100%** (canonical pairwise comparison
    `for i in range(len(numbers)): for j in range(i+1, ...): ...`
    now Python-executes cleanly)
- 6 unit tests added (`align_indent_tests`):
  - `dedents_one_excess_space` ✓ (the SHIP-005 baseline case)
  - `passthrough_when_already_correct` ✓ (no-op safety)
  - `leaves_zero_indent_lines_untouched` ✓ (scope-track safety)
  - `dedents_multi_space_excess` ✓ (N-space generalisation)
  - `empty_completion` ✓ (degenerate input safety)
  - `no_indent_anywhere` ✓ (early-return guard)

Fix (1 file changed):
- `crates/apr-cli/src/commands/eval/inference.rs`:
  - + new fn `align_continuation_indent(prompt, completion) -> String`
    (6-section mutation survey)
  - Hook into `run_humaneval_inference` after
    `truncate_at_function_boundary` and before `execute_python_test`

Validation:
- cargo test -p apr-cli --release --features cuda commands::eval::inference
  → 6 passed, 0 failed
- cargo build -p apr-cli --release --features cuda ✓ (clean)
- LIVE HumanEval/0 1/1 PASS

Spec movement (DEFERRED, not in this PR):
- This is the LAST infrastructure blocker for SHIP-005 LIVE discharge.
- Full 164-problem run on canonical 7B teacher dispatched separately.
- Once SHIP-005 LIVE-discharges: MODEL-1 ship % 94% → 95%.

Refs:
- crates/apr-cli/src/commands/output_verification.rs:492 (PR #1615 — sibling fix)
- crates/apr-cli/src/commands/eval/inference.rs (PR #1616 — eval inference path fix)
- contracts/qwen2-e2e-verification-v1.yaml FALSIFY-QW2E-SHIP-005
- SPEC-SHIP-TWO-001 §61.8 (Branch A bug class)

Closes task #34 PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 crates/apr-cli/src/commands/eval/inference.rs | 141 +++++++++++++++++-
 1 file changed, 140 insertions(+), 1 deletion(-)

diff --git a/crates/apr-cli/src/commands/eval/inference.rs b/crates/apr-cli/src/commands/eval/inference.rs
index 9b8566d2f..f459e80e1 100644
--- a/crates/apr-cli/src/commands/eval/inference.rs
+++ b/crates/apr-cli/src/commands/eval/inference.rs
@@ -360,9 +360,15 @@ fn run_humaneval_inference(
         };
         let completion = truncate_at_function_boundary(&completion);
 
+        // PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL: BPE raw-continuation
+        // can emit a 1-space over-indent at the prompt-completion boundary.
+        // Align to the prompt's last non-empty line's indent before
+        // concatenation — invalid-Python IndentationError otherwise.
+        let aligned = align_continuation_indent(&problem.prompt, completion);
+
         let full_program = format!(
             "{}{}\n\n{}\n\ncheck({})\n",
-            problem.prompt, completion, problem.test, entry
+            problem.prompt, aligned, problem.test, entry
         );
 
         let ok = execute_python_test(&full_program, 10);
@@ -567,6 +573,139 @@ pub(super) fn truncate_at_function_boundary(completion: &str) -> &str {
     completion
 }
 
+/// PMAT-CODE-SHIP-005-WHITESPACE-RESIDUAL: normalise raw-continuation indent.
+///
+/// HumanEval prompts end with `    """\n` (4-space-indented docstring close);
+/// the function body should continue at 4-space indent. On `apr eval --task
+/// humaneval` raw-continuation path, the model emits 5-space leading indent
+/// (BPE tokenization artifact at the prompt-completion boundary). The
+/// resulting concatenation `    """\n     for i in...` is invalid Python
+/// (IndentationError).
+///
+/// Manual `apr run` on the same model with auto-wrap produces correct
+/// 4-space; the bug is raw-continuation-specific.
+///
+/// Fix: detect the prompt's expected continuation indent (last non-empty
+/// line's leading-space count) vs the completion's first non-empty line
+/// indent; if completion is over-indented, dedent every line by the
+/// excess. Only over-indented completions are touched (no risk to
+/// correctly-aligned outputs).
+///
+/// Lines without sufficient leading whitespace (blank lines or top-level
+/// code) are left untouched.
+pub(super) fn align_continuation_indent(prompt: &str, completion: &str) -> String {
+    let expected_indent = prompt
+        .lines()
+        .rev()
+        .find(|l| !l.trim().is_empty())
+        .map(|l| l.chars().take_while(|c| *c == ' ').count())
+        .unwrap_or(0);
+
+    let actual_indent = completion
+        .lines()
+        .find(|l| !l.trim().is_empty())
+        .map(|l| l.chars().take_while(|c| *c == ' ').count())
+        .unwrap_or(0);
+
+    if actual_indent <= expected_indent {
+        return completion.to_string();
+    }
+
+    let excess = actual_indent - expected_indent;
+    let prefix = " ".repeat(excess);
+
+    // Dedent only the function-body chunk — stop at the first non-empty
+    // line that drops to indent 0 (signaling we've exited the function
+    // scope; e.g., `if __name__ == "__main__":` post-amble). Top-level
+    // code at indent < `excess` must be preserved as-is.
+    let mut in_body = true;
+    completion
+        .split_inclusive('\n')
+        .map(|line| {
+            let trimmed = line.trim_start_matches(' ').trim_end_matches('\n');
+            // Track scope transition: once we see a non-empty 0-indent line,
+            // we're past the function body — leave all subsequent lines alone.
+            if in_body && !trimmed.is_empty() {
+                let leading = line.chars().take_while(|c| *c == ' ').count();
+                if leading == 0 {
+                    in_body = false;
+                }
+            }
+            if in_body && line.starts_with(&prefix) {
+                line[excess..].to_string()
+            } else {
+                line.to_string()
+            }
+        })
+        .collect()
+}
+
+#[cfg(test)]
+mod align_indent_tests {
+    use super::align_continuation_indent;
+
+    /// Pre-fix HumanEval/0 reproduction: 5-space body indent should
+    /// dedent to 4-space, with relative inner nesting preserved.
+    #[test]
+    fn dedents_one_excess_space() {
+        let prompt = "def f(x: int) -> int:\n    \"\"\" doc.\n    \"\"\"\n";
+        let completion = "     for i in range(x):\n         if i > 0:\n             return i\n     return 0\n";
+        let got = align_continuation_indent(prompt, completion);
+        let want = "    for i in range(x):\n        if i > 0:\n            return i\n    return 0\n";
+        assert_eq!(got, want);
+    }
+
+    /// Correctly-aligned completion is left unchanged.
+    #[test]
+    fn passthrough_when_already_correct() {
+        let prompt = "def f():\n    \"\"\"doc\"\"\"\n";
+        let completion = "    return 42\n";
+        let got = align_continuation_indent(prompt, completion);
+        assert_eq!(got, completion);
+    }
+
+    /// Top-level code after the function body (e.g., `if __name__`) has 0
+    /// leading spaces and must NOT be dedented (would crash on slice).
+    #[test]
+    fn leaves_zero_indent_lines_untouched() {
+        let prompt = "def f():\n    \"\"\"doc\"\"\"\n";
+        let completion = "     return 1\n\n\nif __name__ == \"__main__\":\n    pass\n";
+        let got = align_continuation_indent(prompt, completion);
+        let want = "    return 1\n\n\nif __name__ == \"__main__\":\n    pass\n";
+        assert_eq!(got, want);
+    }
+
+    /// Multi-space excess (2+) is dedented uniformly.
+    #[test]
+    fn dedents_multi_space_excess() {
+        let prompt = "    pass\n";
+        let completion = "        x = 1\n            nested = 2\n";
+        let got = align_continuation_indent(prompt, completion);
+        // expected = 4 ('    pass' last line), actual = 8 → excess = 4
+        let want = "    x = 1\n        nested = 2\n";
+        assert_eq!(got, want);
+    }
+
+    /// Empty completion is passthrough.
+    #[test]
+    fn empty_completion() {
+        let prompt = "def f():\n    pass\n";
+        let completion = "";
+        let got = align_continuation_indent(prompt, completion);
+        assert_eq!(got, "");
+    }
+
+    /// Mutation-survey section: invariant under no-indent prompt + no-indent
+    /// completion (early-return guard).
+    #[test]
+    fn no_indent_anywhere() {
+        let prompt = "x = 1\n";
+        let completion = "y = 2\n";
+        let got = align_continuation_indent(prompt, completion);
+        assert_eq!(got, completion);
+    }
+}
+
 /// Execute a Python program and check if all assertions pass.
 /// Returns true if exit code is 0, false otherwise.
 /// Enforces a timeout to catch infinite loops (FALSIFY-EVAL-003).