diff --git a/docs/specifications/aprender-train/ship-two-models-spec.md b/docs/specifications/aprender-train/ship-two-models-spec.md index 7e651832b..ca87ef074 100644 --- a/docs/specifications/aprender-train/ship-two-models-spec.md +++ b/docs/specifications/aprender-train/ship-two-models-spec.md @@ -1,7 +1,8 @@ # Specification: Ship Two Models — Sovereign AI Stack Proof **Document ID:** SPEC-SHIP-TWO-001 -**Version:** 3.20.0 +**Version:** 3.21.0 +**Atomic next action (v3.21.0):** **🎉 §75 — MODEL-1 SHIP % = 100% — SHIP-007 LIVE-DISCHARGED via F32 GEMV PTX layout fix (2026-05-13)** (see new §75 below). PR-E (#1651) ships single-file fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs`: the F32 GEMV kernel assumed `[K rows × N cols]` row-major but actual ML weights are `[output_dim=N, input_dim=K]` row-major (PyTorch/SafeTensors/GGUF convention). Kernel was reading TRANSPOSED weights → systematically anti-correlated logits (cos=-0.005). Fix rewrites inner loop to iterate K within row `block_id`. Empirical discharge: `apr bench` 5-iter 128-tok decode = **124.6 tok/s** on RTX 4090 (4.15× over AC-SHIP1-007 30 tok/s floor); PARITY-GATE PASS; default path, no workarounds. **All 10 AC-SHIP1-* LIVE-DISCHARGED.** **MODEL-1 ship %**: **99% → 100%** 🎉. **MODEL-2 ship %**: unchanged at **57%**. **Methodology lesson #22 NEW**: symptom analysis → bug class localization in O(1); methodology lessons compose. **Atomic next action (v3.20.0):** **§74 — SHIP-007 bug LOCALIZED to LM head F32 GEMV via PR-B stage bisection (2026-05-13)** (see new §74 below). PR-B (#1649) APR_GPU_STAGE_DUMP scaffold captured GPU embedding + post_ffn_residual L27 + final_norm + lm_head + CPU lm_head on single BOS token. GPU intermediate values look numerically sane (post_ffn_residual rms=26, final_norm rms=2.84). Divergence emerges between final_norm and logits: GPU logits mean=0.013 vs CPU mean=-2.42 (Δ=2.43; CPU has Qwen's typical negative-bias signature). PMAT-333 dequantizes ALL weights to F32 on GPU upload (28.3 GB), so `WeightQuantType::from_size` returns F32 for LM head → dispatches `f32_gemv_into`. The F32 GEMV kernel is the localized bug surface. **Methodology lesson #21 NEW**: stage-by-stage numerical analysis can localize bug class without per-element diffing. **MODEL-1 ship %**: unchanged at **99%** (Layer 2 localized; PR-E for fix). **MODEL-2 ship %**: unchanged at **57%**. Path-to-100% reduced to a single PR-E. **Atomic next action (v3.19.0):** §72 + §73 combined banner — see both sections below. **Atomic next action (v3.18.0 §73):** **§73 — SHIP-007 cascade reduced from 3 layers to 1 on re-measurement; only Layer 2 (parity) blocks (2026-05-12)** (see new §73 below). §63's 2026-05-11 3-layer blocker stack — (1) FP8 warmup ILLEGAL_ADDRESS, (2) GPU-vs-CPU parity cos=-0.005190, (3) throughput 5.6 vs 30 tok/s floor — re-measured on 2026-05-12 lambda-vector RTX 4090 reveals 2 of 3 layers already discharged: **Layer 1 fixed** (`[PMAT-082] cuBLASLt FP8 JIT warmed (3584×16×3584)` succeeds), **Layer 3 meets floor** (54.5 tok/s @ 128-tok decode, 5-iter median, 1.82× headroom). Only **Layer 2 still blocks** (byte-identical cos=-0.005190 signature). Path to SHIP-007 LIVE-discharge reduced from "5-10 PR / 1-2 week cascade" to **"3-5 PR / 3-5 day single-layer fix"** — add `forward_gpu_traced` → wire `apr trace --device gpu --save-tensor all` → diff CPU vs GPU stage tensors → fix localized stage → discharge. **Methodology lesson #20 NEW**: re-measure cascade layers before continuing; stale state can be reduced cheaply. **MODEL-1 ship %**: unchanged at **99%** (Layer 2 still blocks). **MODEL-2 ship %**: unchanged at **57%**. @@ -5189,6 +5190,104 @@ Spec v3.18.0 → **v3.19.0**. --- +## §75. 🎉 MODEL-1 SHIP % = 100% — SHIP-007 LIVE-DISCHARGED via F32 GEMV PTX layout fix (2026-05-13) + +PR-E (#1651) ships the single-file F32 GEMV PTX layout fix that closes SHIP-007 (AC-SHIP1-007). MODEL-1 ship % crosses **99% → 100%**. SHIP-TWO-001 MODEL-1 is now fully ship-ready. + +### 75.1 The 10/10 LIVE-discharge table + +| AC | Discharge section | Path | +|----|-------------------|------| +| SHIP-001 | §72 | `apr run ` exit 0 | +| SHIP-002 | §61 | `apr run "def fib(n):"` valid Python (#1609) | +| SHIP-003 | §72 | `apr diff` 20 tensors at cos_sim=1.000000 | +| SHIP-004 | §72 | `llama-cli` exit 0, 133.1 gen tok/s | +| SHIP-005 | §71 | HumanEval pass@1 = 86.59% gx10 164-run | +| SHIP-006 | §61.8 | `apr qa` 12-gate aggregate PASS (#1615) | +| **SHIP-007** | **§75 (this section)** | **PARITY-GATE PASS + 124.6 tok/s @ 128-tok decode** | +| SHIP-008 | §61 | `apr run` SHIP-008 USER → 256-token ChatML (#1614) | +| SHIP-009 | §72 | `apr inspect` license/provenance fields | +| SHIP-010 | §72 | sha256 match `0a854098…` | + +**10 of 10 AC-SHIP1-* LIVE-DISCHARGED.** + +### 75.2 SHIP-007 root cause + fix + +The F32 GEMV PTX kernel at `crates/aprender-gpu/src/kernels/gemv/mod.rs::GemvKernel::build_ptx` assumed weight matrix `A` is `[K rows × N cols]` row-major: `A[i,j]` at offset `i*N + j`. The actual ML weight convention is `[output_dim=N, input_dim=K]` row-major: `A[i,j]` at `i*K + j` (PyTorch / SafeTensors / GGUF / dequantized lm_head all follow this). + +Kernel was reading TRANSPOSED weights → computed `y = A^T @ x` instead of `y = A @ x` → systematically anti-correlated logits (cos = -0.005190 vs CPU, top-10 divergences all sign-flipped, GPU mean=0.013 vs CPU mean=-2.42). + +The fix: rewrite the inner loop to iterate K within row `block_id`: +- `row_base = a_ptr + block_id * K * 4` +- thread `t` reads `A[block_id, t]`, `A[block_id, t+32]`, … + +### 75.3 Empirical discharge proof + +``` +$ apr bench --iterations 5 --max-tokens 128 --json +{ + "iterations": 5, + "median_time_ms": 1016.4, + "tokens_per_second": 124.6, + "passed": true, + "latency_p50_ms": 1016.4, + "latency_p95_ms": 1073.3, + "time_to_first_token_ms": 8.39 +} +``` + +- AC-SHIP1-007 floor: 30 tok/s +- Headroom: **4.15× over floor** +- PARITY-GATE: PASS (no error from `forward_gpu_resident`) +- Default path (CUDA graphed, no `SKIP_PARITY_GATE`, no `APR_SKIP_FP8_WARMUP`) + +### 75.4 Cascade arc — full closeout + +| § | Date | Discovery | Impact | +|---|------|-----------|--------| +| 63 | 2026-05-11 | SHIP-007 framed as 3-layer cascade (FP8 + parity + throughput) | scope identified | +| 73 | 2026-05-12 | Re-measurement: 2/3 layers already fixed; only parity blocks | scope -3× | +| **74** | **2026-05-13** | **Bug LOCALIZED to F32 GEMV via PR-B stage bisection** | scope -10× | +| 75 | 2026-05-13 | **PR-E layout fix → MODEL-1 100%** | DISCHARGED | + +Per §73's "3-5 PR / 3-5 day" estimate. Actual: 4 PRs (PR-A contract, PR-B scaffold, §74 docs, PR-E fix) shipped over 2 calendar days. + +### 75.5 Ship-% movement + +- **MODEL-1 ship %**: **99% → 100%** 🎉 +- **MODEL-2 ship %**: unchanged at **57%** (independent track, gated on step 5g.3 val_loss < 9.38) + +### 75.6 Methodology lesson #22 (NEW) + +**Symptom analysis → bug-class localization in O(1) when you know the symptom.** §74 captured CPU vs GPU stage-level stats. The signature — sign-flipped top-K divergences, CPU mean=-2.4 vs GPU mean=0, intermediate stages numerically sane — matches **exactly one bug class**: transposed matmul. Once we knew the kernel was reading transposed weights, the bug was visible in the PTX builder code within seconds (line 86-87: `col_offset = block_id * 4` instead of `row_offset = block_id * K * 4`). + +Lessons #16-21 (compose falsifiers, stage-by-stage stats, predict-then-verify, re-measure cascade) **compose**. Each makes the next cheaper. + +### 75.7 Cumulative methodology lessons through §75 + +| # | Lesson | +|---|--------| +| 6-21 | (see §74) | +| **22** | **Symptom analysis → bug class localization in O(1). Methodology lessons compose; each makes the next cheaper.** | + +### 75.8 What §75 is NOT + +§75 does NOT: +- Modify MODEL-2 (independent track, ship % stays at 57%) +- Discharge any benchmark beyond AC-SHIP1-007 (HumanEval/MBPP unchanged; SHIP-005 stays at 86.59% from §71) +- Imply publish-readiness — GATE-SHIP-001/002/003 still need green CI + post-publish QA per `feedback_post_publish_qa_required.md` + +§75 records that **all 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher on lambda-vector RTX 4090**. This is the contract for AC-SHIP1-* completion. + +Evidence: +- `evidence/section-75-ship-007-discharged-2026-05-13/findings.json` +- `evidence/section-75-ship-007-discharged-2026-05-13/ship-007-bench-discharged.json` (5-iter 128-tok bench, 124.6 tok/s) +- Predecessor: `evidence/section-74-ship-007-bisection-2026-05-13/findings.json` (bug localized) + +Spec v3.18.0 → **v3.21.0** (post-§72/73 stack at 3.18, §74 at 3.20, §75 here at 3.21 — MODEL-1 100%). + +--- + ## §63. SHIP-007 empirical floor — CUDA structurally broken on Qwen 7B; multi-PR cascade scope (2026-05-11) SHIP-007 (decode tps ≥ 30 tok/s on RTX 4090 with `--features cuda` per AC-SHIP1-007) was the last §17.5 PARTIAL hypothesized to discharge from §60 closure. §63 records the LIVE empirical investigation that revealed SHIP-007 is **multi-PR cascade scope**, not a tight 1-PR slice.