docs(spec): SHIP-TWO-001 §61 — post-§60 LIVE-discharge cascade + ChatML generation gap by noahgift · Pull Request #1610 · paiml/aprender

noahgift · 2026-05-10T12:19:53Z

Summary

Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. §60 closed forward parity (layer-3 ratio 18.23× → 1.245× ∈ [0.5, 2.0]); §61 surfaces that forward parity ≠ generation parity under all prompt formats.

Two-track outcome

DIRECT prompt → GREEN: SHIP-002 LIVE-discharged via PR #1609. apr run --prompt "def fib(n):" --max-tokens 128 on canonical 7B APR teacher emits coherent fib() Python (ast.parse 0 syntax errors, 68 nodes, 1 FunctionDef "fib").

Five-Whys

Why §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically.
Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion at activation-stats level; arg-max sampling under cumulative drift not directly bounded.
Why does prompt format matter? Direct prompts → high-confidence next-token regime where drift doesn't flip arg-max. ChatML → low-margin regime where drift CAN flip arg-max.
Why record vs fix now? Bug is multi-PR scope; PRED-61-A/B set up next bisection.
Why durable spec? Each day spec doesn't reflect §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge.

§61.5 falsifiable next predictions

PRED-61-A: GGUF + ChatML on canonical 7B → clean? If GREEN, bug is APR-side in chat-template handling.
PRED-61-B: APR + direct continuation "What is 2+2? The answer is " (no ChatML) → clean? If GREEN, bug is special-token handling NOT cumulative drift.

Changes (1 file)

docs/specifications/aprender-train/ship-two-models-spec.md:
- Atomic next action banner: v3.05.0 → v3.06.0
- New §61 section (above §58, newest-first ordering) with 7 sub-sections (separation table / direct-prompt evidence / ChatML-prompt evidence / §60→§61 separation rationale / falsifiable predictions / ship-% movement / what §61 is NOT)

Ship-% Movement

MODEL-1 ship %: 91% → 92% (1 of 5 §17.5 PARTIALs LIVE-discharged via PR feat(contracts): SHIP-002 PARTIAL → DISCHARGED via LIVE apr run on canonical 7B teacher #1609)
MODEL-2 ship %: unchanged at 57% (gated on step 5g.3 val_loss < 9.38)

Cascade-this-session

6 PRs in 24h working SHIP-TWO-001:

fix(aprender-train): CUDA forward path applies Q/K/V biases (H4D root-cause discharge) #1604 H4D Q/K/V bias dispatch
fix(aprender-train): CUDA RMSNorm honours config.rms_norm_eps (Qwen 1e-6 vs hardcoded Llama 1e-5) #1606 RMSNorm eps cache key
fix(aprender-train): include theta in CUDA RoPE PTX cache key #1607 RoPE theta cache key
chore(contracts): apr-vs-gguf-forward-parity-v1 v1.2.0 — promote PROPOSED → ACTIVE_FUNCTIONAL (§60 closure) #1608 apr-vs-gguf parity v1.2.0 promote (§60 closure)
feat(contracts): SHIP-002 PARTIAL → DISCHARGED via LIVE apr run on canonical 7B teacher #1609 SHIP-002 LIVE discharge
(this PR) §61 spec amendment

Test Plan

No code changes; pure documentation
Section format consistent with §58 (newest-first, sub-sections numbered §61.X)
All 6 cascade PRs referenced explicitly
Methodological alignment: zero eprintln!, all evidence via existing apr CLI primitives

🤖 Generated with Claude Code

…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61) Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. Two-track outcome: DIRECT PROMPT (SHIP-002): GREEN. `apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr --prompt "def fib(n):" --max-tokens 128` produces clean fib() Python (`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0). CHATML PROMPT (SHIP-006/008): BLOCKED. Same canonical 7B teacher fails `apr qa golden_output` gate with "gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper `<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`. Same model + same engine + different prompt format → different output regime. The §60 closure proved per-layer FORWARD parity within Q4K tolerance (layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove GENERATION parity under arbitrary prompt distributions. §61 separates these two invariants and surfaces the asymmetry as a NEW finding. Five-Whys for the §61 amendment: 1. Why is §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically. 2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion only at the activation-stats level; arg-max sampling under cumulative drift is not directly bounded. 3. Why does prompt format matter? Direct prompts ("def fib(n):") put model in high-confidence next-token regime where small drift doesn't flip arg-max. ChatML prompts (instruction-following, chain-of-thought initialization) put model in low-margin regime where drift CAN flip arg-max. 4. Why record this in spec rather than just fix? The bug is multi-PR scope (special-token handling vs cumulative drift bisection needed). PRED-61-A/B set up the next falsifiable diagnostic step. 5. Why now (durable spec rather than evidence-only)? Each day the spec doesn't reflect the §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge. §61.5 falsifiable predictions: - PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN, bug is APR-side in chat-template handling. - PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer is " (no ChatML wrapper) → clean output? If GREEN, bug is special- token handling NOT cumulative drift. If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to "APR + ChatML special-token path" — multi-PR scope but tractable. Changes (1 file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.05.0 → v3.06.0; new banner summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE, SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up). - New §61 section above §58 (newest-first ordering): 7 sub-sections (61.1 separation table, 61.2 direct-prompt evidence, 61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale, 61.5 falsifiable next investigation step, 61.6 ship-% movement, 61.7 what §61 is NOT). Validation: - Spec section format consistent with §58 (newest-first, dated, sub- sections numbered §61.X). - All 6 cascade PRs from this session referenced explicitly (#1604, #1606, #1607, #1608, #1609, this PR). - Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs). - Methodological alignment: zero eprintln!, zero bash workarounds; all evidence captured via existing apr CLI primitives. Refs: - evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory) - contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED) - contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608) - ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 10, 2026 12:20

noahgift merged commit bbbdf6b into main May 10, 2026
11 checks passed

noahgift deleted the docs/ship-two-spec-section-61-post-60-generation-gap branch May 10, 2026 12:46

noahgift mentioned this pull request May 10, 2026

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy #1611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §61 — post-§60 LIVE-discharge cascade + ChatML generation gap#1610

docs(spec): SHIP-TWO-001 §61 — post-§60 LIVE-discharge cascade + ChatML generation gap#1610
noahgift merged 1 commit into
mainfrom
docs/ship-two-spec-section-61-post-60-generation-gap

noahgift commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 10, 2026

Summary

Two-track outcome

Five-Whys

§61.5 falsifiable next predictions

Changes (1 file)

Ship-% Movement

Cascade-this-session

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant