diff --git a/docs/specifications/aprender-train/ship-two-models-spec.md b/docs/specifications/aprender-train/ship-two-models-spec.md index ee00a9340..5e020c882 100644 --- a/docs/specifications/aprender-train/ship-two-models-spec.md +++ b/docs/specifications/aprender-train/ship-two-models-spec.md @@ -1,7 +1,8 @@ # Specification: Ship Two Models — Sovereign AI Stack Proof **Document ID:** SPEC-SHIP-TWO-001 -**Version:** 3.05.0 +**Version:** 3.06.0 +**Atomic next action (v3.06.0):** **§61 — Post-§60 LIVE-discharge cascade — direct-prompt SHIP-002 GREEN; ChatML-prompt SHIP-006/008 surface a generation-quality gap (2026-05-10)** (see new §61 below). §60 closure unblocked the §17.5 chain. This session shipped the SHIP-002 LIVE discharge (PR #1609) — `apr run --prompt "def fib(n):" --max-tokens 128` on canonical 7B APR teacher emits coherent fib() Python with 0 syntax errors / 68 AST nodes / 1 FunctionDef. But the parallel `apr qa` LIVE attempt surfaced a NEW empirical finding: the SAME canonical teacher fails the `golden_output` gate ("gibberish, fragment '\\ns\\ns' repeats 3+ times") under the ChatML-wrapped prompt `<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`. Forward-parity (§60) ≠ generation parity. SHIP-006/008 blocked on this ChatML degenerate-output bug; SHIP-007 separately blocked on perf (8.8 tok/s vs 30 floor on CPU fallback path). §61 records the two falsifiable predictions for the next bisection: PRED-61-A (GGUF + ChatML → CLEAN? localizes bug to APR side); PRED-61-B (APR + direct continuation "What is 2+2? The answer is " → CLEAN? localizes bug to special-token handling vs cumulative drift). Cascade-this-session: 6 PRs (#1604/#1606/#1607/#1608/#1609 + this §61). **MODEL-1 ship %**: **91% → 92%** (1 of 5 §17.5 PARTIALs LIVE-discharged via #1609; SHIP-005/006/007/008 stay PARTIAL). **MODEL-2 ship %**: unchanged at **57%** until step 5g.3 produces val_loss < 9.38. Coverage tally: 1 new LIVE discharge (SHIP-002 in `qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0); plus 1 status flip (`apr-vs-gguf-forward-parity-v1` v1.1.0 → v1.2.0 PROPOSED → ACTIVE_FUNCTIONAL via PR #1608); plus 3 cascade fixes in `aprender-train` CUDA forward path (Q/K/V bias dispatch / RMSNorm eps cache key / RoPE theta cache key — PRs #1604/#1606/#1607). **Atomic next action (v3.05.0):** **§60 — SHIP-007 §22 FULLY CLOSED — H1 CONFIRMED apples-to-apples on canonical 7B teacher; layer-3 ratio 18.23× → 1.245× (2026-05-07)** (see companion-spec entries M91-M103 + parity #89 for full per-PR narrative; aprender contract `contracts/trace-ffn-sub-block-gguf-v1.yaml` v1.0.0 → v1.13.0 across 13 amendments). M-FFN-GGUF-5 fix shipped (aprender PR #1550 squash pending) + M-FFN-GGUF-7 multi-layer real-teacher chain shipped (aprender PR #1548 MERGED). **MAJOR PLOT TWIST in M103 fix PR**: §27's 18.23× std-ratio was a TEST METHODOLOGY ARTIFACT, NOT a numerical bug. GGUF's `forward_traced` does Phase 1 prefill silently and only captures stats on the LAST token; APR's `forward_traced` captured stats across ALL 7 tokens. The §27 measurement compared multi-token APR std (7-token × 28672 elements) vs single-token GGUF std (1-token × 4096 elements) — fundamentally incomparable distributions. **Two coherent fixes in M-FFN-GGUF-5 PR #1550**: (1) `forward_traced` now uses Q4K+Q8K dispatch via new helper `matmul_q4k_or_f32_traced` (multi-token aware, F32 fallback when Q4K unavailable, 7 call sites updated); (2) M89 harness compares APR's `last_token.ffn_swiglu_inner_stats` against GGUF's `ffn_swiglu_inner_stats` (apples-to-apples last-token-only on both sides). **EMPIRICAL END-TO-END VERIFICATION** (2026-05-07, lambda-vector RTX 4090, 178s wall): all 28 layers within H1 band [0.5, 2.0]; **layer-3 ratio = 1.245×** (was 18.23× pre-methodology-fix). **Verdict flipped: H2 (apparent APR-side bug) → H1 CONFIRMED (apples-to-apples agreement)**. The cascade's per-tensor mechanism (M94 0.077% Path A vs Path B per matmul) and compounding (M95 5.70× synthetic / M-FFN-GGUF-7 1.81× real-saturating) ARE real numerical findings — but the §27 1723% magnitude that made the bug look severe was test-methodology-inflated. **M-FFN-GGUF-7 finding** (M102 PR #1548): real-layer chain SATURATES at 1.81× over 5 layers (vs synthetic M95's 5.70×); Layer 2 drops to 0.029% from weight-pattern cancellation; naive growth-factor exponentiation gives 1.81^22.4 = 5.78e5× at 28-layer depth — physically impossible; real systems saturate. **Methodology lesson #7 NEW** (`feedback_test_methodology_can_fake_bugs.md`): when comparing two implementations via summary statistics (std/mean/cosine), VERIFY both sides measure the SAME distribution shape (count, dim, element selection) BEFORE trusting the comparison. Mismatched distribution shapes can amplify a small real divergence into an apparent magnitude that looks like a bug. SHIP-007 §22 burned ~3 weeks pre-cascade + 2 days cascade + 2 hours fix on a methodology issue that produced a fake apparent magnitude on top of the real per-matvec mechanism. **15,233 lib tests pass, 0 failures**; production hot paths byte-unchanged (only `forward_traced` touched in PR #1550). **Discharge potential**: per §17.5, M-FFN-GGUF-5 closure transitively enables individual discharge of 5 MODEL-1 PARTIALs (SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008); each may need its own contract-level promotion follow-up. **MODEL-1 ship %**: 91% → **96% pending individual partial discharges**. **MODEL-2 ship %**: unchanged at **57%** until step 5g.3 produces val_loss < 9.38. Coverage tally: 12 falsifiers + 1 fix DISCHARGED across `trace-ffn-sub-block-gguf-v1` v1.0.0 → v1.13.0 cascade. **Total session: 28 PRs across 2 days** including 1 actual fix landing. **Atomic next action (v3.04.0):** **§59 — SHIP-007 §22 falsifier cascade CLOSED — 11 PRs (M91-M101) decompose §27 1723% within rounding; fix scope EMPIRICALLY VALIDATED as Option-A (2026-05-06+07)** (see companion-spec entries M91-M101 in `claude-code-parity-apr/docs/specifications/claude-code-parity-apr-poc.md` for the full per-PR cascade narrative; aprender contract `contracts/trace-ffn-sub-block-gguf-v1.yaml` v1.0.0 → v1.12.0 across 12 amendments). Two-day autonomous /loop session shipped 11 lib-test + 1 integration-test falsifiers (aprender PRs #1535/#1536/#1537/#1538/#1540/#1541/#1542/#1543/#1544/#1545) decomposing the §27 layer-3 ffn_swigl 18.23× APR-vs-GGUF std-ratio (=1723% deviation from 1.0). **Final empirical decomposition (2026-05-07)**: 0.077% per-tensor mechanism (M94, FALSIFY-FFN-GGUF-008 — first CONFIRMED bit-divergence between APR's standalone-dequant + F32-matmul "Path A" semantics vs GGUF's Q8K-activation-quant + fused-inline-dequant "Path B" semantics on synthetic 144-byte Q4K super-block) × 5.70× super-linear compounding (M95, 5 chained matvecs grow 0.077% → 0.4391%) × 50× std-ratio measurement sensitivity (M99, batch-dimension std measurement vs per-tensor rel_diff) × 5.56× LIVE real-teacher amplification (M100, FALSIFY-FFN-GGUF-014 LIVE on canonical 7B Qwen2.5-Coder-Instruct-Q4_K_M layer-3 ffn_down_weight Q4K bytes from `/mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr`: Path A=-1.658492 [`0xbfd44977`] vs Path B=-1.665596 [`0xbfd5323e`], rel_diff 0.428%) × 14× residual = ~1715% — **within rounding of §27's 1723%**. **Six synthetic amplifier candidates resolved**: A1 (RoPE phase, M98) FALSIFIED 1.00× UNITARY; A2 (softmax saturation, M97) FALSIFIED 0.01× COMPRESSES; A3 (block-scale variance, M96) FALSIFIED 1.00× SCALE-INVARIANT; A4 (multi-token batch, M99) FALSIFIED 0.26× per-token PLUS 50× std-ratio measurement sensitivity finding; A5 (real-weight non-uniformity, M100) **PARTIALLY CONFIRMED 5.56× LIVE on canonical 7B**; A6 (RMSNorm rsqrt, M101) FALSIFIED 1.00× HOMOGENEOUS. **14× residual gap is now attributed entirely to cumulative-layer interaction** (synthetic single-layer + homogeneous-RMSNorm tests cannot capture it; M-FFN-GGUF-7 multi-layer real-teacher chain is the only remaining test path but does NOT block fix PR). **SHIP-007 §22 fix scope EMPIRICALLY VALIDATED as Option-A (PROMOTE GGUF-PATH semantics into APR forward)**: switching APR's `f32_matmul` to Q8K activation quant + fused matvec semantics will recover the 5.56× per-matvec amplification on every matmul, eliminating cumulative APR-vs-GGUF drift. Estimated fix scope ~250-400 LOC; transitively discharges 5 MODEL-1 PARTIALs (SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008) per §17.5. Cascade methodology lessons consolidated to `~/.claude/projects/-home-noah-src-aprender/memory/feedback_falsifier_cascade_decomposes_magnitude.md` and `feedback_falsifier_chain_assert_difference.md`. **MODEL-1 ship %**: unchanged at **91%** until M-FFN-GGUF-5 (the actual fix PR) lands. **MODEL-2 ship %**: unchanged at **57%** until step 5g.3 produces val_loss < 9.38. Coverage tally: 11 new falsifiers DISCHARGED across `trace-ffn-sub-block-gguf-v1` v1.0.0 → v1.12.0 cascade. **Atomic next action (v3.03.0):** **§58 — v0.32.0 cascade publish + release-engineering hygiene snapshot (Issue #1514 CLOSED, 6 PRs, 4 hidden defects surfaced + closed) (2026-05-05)** (see new §58 below). Issue #1514 (v0.32.0 cascade publish) CLOSED at 16:14:56Z. Four user-facing crates now live on crates.io at v0.32.0: `aprender`, `aprender-rag`, `aprender-core`, `apr-cli` (verified via `cargo search`). Cascade surfaced 4 release-engineering defects, all closed in their own PRs: #1512 (aprender-rag `[lib] name = "trueno_rag"` → `"aprender_rag"` BREAKING — `use aprender_rag::*` was uncompilable in v0.31.x), #1513 (aprender-orchestrate `cmd_code` 7→8 arg drift on upstream `emit_trace` addition), #1515 + #1517 (aprender-core dev-dep publish-time cycle: path-only and then permissive `version = ">=0.27"` + path, after clean-room sed-strip left invalid `{ package = "..." }` entries), #1518 (apr-cli `include_str!("../../../../configs/aliases.yaml")` failed cargo publish — files outside crate dir excluded; fix copies aliases.yaml into `crates/apr-cli/configs/`). PR #1511 ships `pv lint --strict-test-binding`, closing §57.4's foreshadowed prevention rule. 5g.1 corpus retokenize (PID 2767124) at 62 shards / 16h19m wall (past initial 57-shard estimate; rate ≈ 15-16 min/shard; manifest pending end-of-run). **MODEL-1 ship %**: unchanged at **91%**. **MODEL-2 ship %**: unchanged at **57%** until step 5g.3 produces val_loss < 9.38. Coverage tally: snapshot (release-engineering hygiene, not falsifier flip). @@ -4482,6 +4483,143 @@ Per `feedback_fix_root_cause_never_route_around.md`: the §28 fix would have rou The Toyota Way fix is to bisect upstream, not to flip the kernel call. +## §61. Post-§60 LIVE-discharge cascade — direct-prompt SHIP-002 GREEN; ChatML-prompt SHIP-006/008 surface a generation-quality gap (2026-05-10) + +§60 closed the SHIP-007 §22 binding-criterion: per-layer APR↔GGUF ffn_swigl ratio falls within H1 band [0.5, 2.0] on canonical 7B teacher (M-FFN-GGUF-5 PR #1550 + M-FFN-GGUF-7 PR #1548). Per §17.5 this transitively unblocks 5 MODEL-1 PARTIAL ship-row claims (SHIP-002/005/006/007/008). §61 records the LIVE-discharge cascade attempted from §60 and surfaces a NEW empirical finding: forward-parity passing does NOT imply generation-quality passing under all prompt formats. + +### 61.1 What §61 records vs what §60 closed + +| Track | §60 outcome (2026-05-07) | §61 outcome (2026-05-10) | +|------|--------------------------|--------------------------| +| Per-layer cosine parity (binding criterion) | layer-3 ratio 18.23× → 1.245× | unchanged — discharged via PR #1608 (`apr-vs-gguf-forward-parity-v1` v1.2.0 ACTIVE_FUNCTIONAL) | +| §17.5 SHIP-002 LIVE | upstream blocker resolved | **DISCHARGED** via PR #1609 — `apr run --prompt "def fib(n):" --max-tokens 128` emits coherent fib() Python (`ast.parse` 0 syntax errors, 68 nodes) | +| §17.5 SHIP-006 LIVE (`apr qa` 8 gates aggregate) | dispatch-ready | **BLOCKED** — `golden_output` gate fails with "gibberish (fragment '\\ns\\ns' repeats 3+ times)" on canonical 7B APR teacher under ChatML prompt | +| §17.5 SHIP-007 LIVE (decode tps ≥ 30) | dispatch-ready | **BLOCKED** — observed throughput 8.8 tok/s on CPU fallback path; below 30 floor | +| §17.5 SHIP-008 LIVE (ChatML teacher render) | dispatch-ready | **BLOCKED** — same ChatML degenerate-output bug as SHIP-006 | +| §17.5 SHIP-005 LIVE (HumanEval pass@1 ≥ 86%) | dispatch-ready | **NOT YET ATTEMPTED** — gated on the same ChatML bug if the eval harness wraps prompts in ChatML | + +The empirical asymmetry is the load-bearing finding of §61: **direct prompts work; ChatML-wrapped prompts produce gibberish.** + +### 61.2 The empirical evidence — direct prompt SHIP-002 LIVE-discharge + +Live run on noah-Lambda-Vector RTX 4090 (2026-05-10, apr v0.32.0 post-e856eb91f): + +```bash +apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr \ + --prompt "def fib(n):" --max-tokens 128 +``` + +Wall time: 76.11s (cached load). Backend dispatch chain: +- CUDA → transient `CUDA_ERROR_ILLEGAL_ADDRESS` (workspace reinit failed; non-fatal) +- wgpu → rejected by `apr-cpu-vs-gpu-output-parity-v1` gate (cosine vs CPU = 0.766 < 0.99 + lm_head 2180 MB > 2147 MB limit) +- CPU → SELECTED (post-fallback path) + +Output: + +```python +def fib(n): + if n <= 0: + return "Input should be a positive integer" + elif n == 1: + return 0 + elif n == 2: + return 1 + else: + a, b = 0, 1 + for i in range(2, n): + a, b = b, a + b + return b +``` + +Python `ast.parse`: **0 syntax errors**, 68 AST nodes, 1 FunctionDef "fib", 19 distinct AST node kinds. Discharged into `evidence/ship-002-discharge-2026-05-10/`. Contract `qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0 records the LIVE evidence chain. + +### 61.3 The empirical evidence — ChatML-wrapped prompt SHIP-006 BLOCKED + +`apr qa` invokes a `golden_output` gate that wraps "What is 2+2?" in ChatML: + +``` +<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n +``` + +Live run on the same canonical 7B APR teacher (2026-05-10, apr v0.32.0): + +```bash +apr qa /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr --json +``` + +Verdict: **FAIL**. The gate JSON reports: + +```json +{ + "name": "golden_output", + "passed": false, + "message": "golden_output: gibberish (fragment \"\\ns\\ns\" repeats 3+ times)", + "duration_ms": 86144, + "skipped": false +} +``` + +Throughput on the same APR file: 8.8 tok/s (well below SHIP-007's 30 tok/s floor). Five of eleven gates skipped because format ≠ GGUF (ollama_parity, gpu_speedup, format_parity, ptx_parity, gpu_state_isolation), one skipped because `--assert-classifier-head` not requested. + +The same model that emitted clean fib() Python via `apr run --prompt "def fib(n):"` produces degenerate `\ns\ns\ns…` repetition under the ChatML wrapper. The byte-identical model + identical inference engine + different prompt format → different output regime. + +### 61.4 The §60 → §61 separation + +§60 closed the **forward parity invariant**: per-layer activation statistics agree between APR and GGUF reference within Q4K tolerance on the canonical 7-token prompt `[3838, 374, 220, 17, 10, 17, 30]` ("What is 2+2?" tokenized). That gate is binary and discharged. + +§61 surfaces that forward parity is **not** sufficient for generation parity. Two model paths can produce statistically-identical activations and still produce different sampled tokens at sufficiently long generation lengths or under sufficiently different prompt distributions. The mechanism is subtle: + +1. **Per-layer parity** (§60) measures activation statistics over a fixed input. +2. **Generation quality** (§61) measures sampled tokens over an autoregressive trajectory. +3. Even tiny per-layer drift (1.245× ratio is not 1.000×) compounds across many tokens. +4. The compounding interacts with the **sampling distribution** at each step. +5. Different prompt formats (direct vs ChatML) push the model into different attention regimes, where cumulative drift behaves differently. + +The §27 1723% magnitude was test-methodology-inflated (M103 plot twist), but the underlying per-tensor mechanism (M94 0.077% Path A vs Path B per matvec) IS real numerical drift that compounds. Under direct prompts ("def fib(n):") the model has high-confidence next-token distributions and the drift doesn't flip arg-max. Under ChatML prompts the model is in a low-margin regime (instruction-following, multi-token chain-of-thought initialization) and the drift CAN flip arg-max, producing token-by-token degenerate trajectories that look like "gibberish". + +### 61.5 Falsifiable next investigation step + +§61's load-bearing diagnostic: **bisect the prompt-format-dependence of the generation gap.** + +Two falsifiable predictions: + +1. **PRED-61-A — same model, GGUF, ChatML prompt → CLEAN output.** If GGUF passes `apr qa golden_output` on the canonical Qwen2.5-Coder-7B-Instruct teacher with the same ChatML "What is 2+2?" prompt, the bug is APR-side in the inference path's chat-template handling (probably tokenizer-special-token application or causal mask construction at the boundary). + +2. **PRED-61-B — same model, APR, direct prompt with continuation → CLEAN output.** If `apr run --prompt "What is 2+2? The answer is " --max-tokens 32` (no ChatML wrapper, just text) produces "4" or near-equivalent, the bug is specifically in the special-token handling, NOT in long-tail cumulative drift. + +If both PRED-61-A and PRED-61-B are GREEN, the bug is localized to "APR + ChatML special-token path" — multi-PR scope but bounded. + +### 61.6 Spec-relevant ship-% movement + +- MODEL-1 ship %: **91% → 92%** (1 of 5 §17.5 PARTIALs LIVE-discharged via PR #1609, SHIP-002). +- MODEL-1 ship %: STAYS at 92% until the ChatML generation gap closes; SHIP-005/006/008 are co-blocked on it; SHIP-007 is co-blocked on a separate perf issue (8.8 tok/s vs 30 floor). +- MODEL-2 ship %: unchanged at **57%** (gated on step 5g.3 val_loss < 9.38; the SHIP-TWO-001 cascade for MODEL-2 is independent of §61). + +### 61.7 What §61 is NOT + +§61 does NOT amend any contract status to claim a fix. It records: +- An empirical signal (direct vs ChatML asymmetry). +- Two falsifiable predictions (PRED-61-A, PRED-61-B). +- The next bisection step. + +The §61 amendment is durable spec; the actual ChatML bug fix is a follow-up cascade (multi-PR, scope unknown until PRED-61-A/B fire). + +Methodological alignment: zero `eprintln!` debug, zero bash workarounds. All evidence captured via existing `apr run`/`apr qa` CLI primitives. Spec v3.05.0 → **v3.06.0**. Coverage tally unchanged this cycle (snapshot, not falsifier flip). + +Evidence persisted to: + +``` +evidence/ship-002-discharge-2026-05-10/ # SHIP-002 LIVE-discharge artifact +├── discharge-evidence-v1.json # 5-step verification chain + provenance +├── apr-run-output.txt # raw apr run log +├── fib-completion.py # extracted Python source +└── ast-parse-result.json # ast.parse verdict +``` + +The SHIP-006 BLOCKED finding does NOT yet have a dedicated evidence directory — by §61.7 design, snapshot in spec is sufficient until the bisection (PRED-61-A/B) fires. + +--- + ## §58. v0.32.0 cascade publish + release-engineering hygiene snapshot (Issue #1514 CLOSED) (2026-05-05) §57 closed with the §50.4 drift-sweep complete and 5g.1 mid-flight at 13/57 shards. §58 records the parallel **release-engineering** track that landed during the same wait window: the v0.32.0 user-facing-crate cascade publish (Issue #1514 CLOSED) and the four hidden defects it surfaced + closed. This is the second hygiene amendment in a row — the first (§57) was contract-drift hygiene; this one is publish-pipeline hygiene.