docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy by noahgift · Pull Request #1611 · paiml/aprender

noahgift · 2026-05-10T12:50:15Z

Summary

Same-day continuation of §61 (PR #1610). Both PRED-61-A and PRED-61-B fired live on canonical 7B teacher; surfaces a refined 3-way bug taxonomy.

Stacks on PR #1610 — this branch has §61 + §61.8 commits. When #1610 merges first, this PR will simplify to just §61.8.

Predictions Fired

PRED-61-B GREEN (predicted):

apr run <APR teacher> --prompt "What is 2+2? The answer is " --max-tokens 32 → 4 ✓
Wall: 79.09s. Confirms APR direct path is semantically correct.

PRED-61-A RED — but in an unexpected way:

apr run <GGUF teacher> emits byte-identical "ampiezza = 0.5\ndiametro = 10\n..." Italian gibberish across THREE distinct prompts (direct continuation / ChatML wrapper / conversational).
Wall times: 48.73s / 48.68s / 39.65s — different (inference IS running, not cached), but text matches byte-for-byte.
This is a prompt-insensitive structural bug in GGUF inference path.

Refined 3-Way Bug Taxonomy

Path	Output	Verdict	Bug scope
APR + direct	Coherent, prompt-correlated	WORKING	Matches §60
APR + ChatML	`"\ns\ns\ns…"` degenerate	BROKEN	APR-side ChatML special-token handling
GGUF + any prompt	Byte-identical `"ampiezza..."`	BROKEN	GGUF input-handling/state-init

Two Independent Investigation Branches

Branch A: APR ChatML degenerate-output. Bisect via apr trace --payload on layer-0 attn_norm at first generated-token position.
Branch B: GGUF prompt-insensitive canned-output. Instrument realizar::inference::forward to log actual token IDs reaching embedding lookup.

§17.5 PARTIALs Per Branch

SHIP-006 (apr qa golden_output) co-blocked on Branch A AND Branch B
SHIP-008 (chat template render) blocked on Branch A
SHIP-005 (HumanEval) likely blocked on Branch B
SHIP-007 (decode tps ≥ 30) likely blocked on Branch B

Methodology Lesson #8

A falsifier's RED outcome may surface a DIFFERENT bug class than the one being investigated. PRED-61-A asked "is GGUF + ChatML clean?" — the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control ("Hello"), §61.8's 3-way taxonomy would have collapsed into "all paths broken under ChatML" — mis-localizing.

Ship-% Movement

MODEL-1 ship %: stays at 92% (refines picture, does NOT ship a fix or LIVE-discharge).
MODEL-2 ship %: unchanged at 57% (gated on step 5g.3).

🤖 Generated with Claude Code

…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61) Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. Two-track outcome: DIRECT PROMPT (SHIP-002): GREEN. `apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr --prompt "def fib(n):" --max-tokens 128` produces clean fib() Python (`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0). CHATML PROMPT (SHIP-006/008): BLOCKED. Same canonical 7B teacher fails `apr qa golden_output` gate with "gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper `<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`. Same model + same engine + different prompt format → different output regime. The §60 closure proved per-layer FORWARD parity within Q4K tolerance (layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove GENERATION parity under arbitrary prompt distributions. §61 separates these two invariants and surfaces the asymmetry as a NEW finding. Five-Whys for the §61 amendment: 1. Why is §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically. 2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion only at the activation-stats level; arg-max sampling under cumulative drift is not directly bounded. 3. Why does prompt format matter? Direct prompts ("def fib(n):") put model in high-confidence next-token regime where small drift doesn't flip arg-max. ChatML prompts (instruction-following, chain-of-thought initialization) put model in low-margin regime where drift CAN flip arg-max. 4. Why record this in spec rather than just fix? The bug is multi-PR scope (special-token handling vs cumulative drift bisection needed). PRED-61-A/B set up the next falsifiable diagnostic step. 5. Why now (durable spec rather than evidence-only)? Each day the spec doesn't reflect the §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge. §61.5 falsifiable predictions: - PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN, bug is APR-side in chat-template handling. - PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer is " (no ChatML wrapper) → clean output? If GREEN, bug is special- token handling NOT cumulative drift. If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to "APR + ChatML special-token path" — multi-PR scope but tractable. Changes (1 file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.05.0 → v3.06.0; new banner summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE, SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up). - New §61 section above §58 (newest-first ordering): 7 sub-sections (61.1 separation table, 61.2 direct-prompt evidence, 61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale, 61.5 falsifiable next investigation step, 61.6 ship-% movement, 61.7 what §61 is NOT). Validation: - Spec section format consistent with §58 (newest-first, dated, sub- sections numbered §61.X). - All 6 cascade PRs from this session referenced explicitly (#1604, #1606, #1607, #1608, #1609, this PR). - Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs). - Methodological alignment: zero eprintln!, zero bash workarounds; all evidence captured via existing apr CLI primitives. Refs: - evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory) - contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED) - contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608) - ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… taxonomy (PMAT-CODE-SHIP-TWO-SECTION-61-8) Same-day continuation of §61. Both falsifiable predictions fired on noah-Lambda-Vector RTX 4090 (apr v0.32.0 post-e856eb91f). PRED-61-B GREEN (predicted): apr run <APR teacher> --prompt "What is 2+2? The answer is " → "4" Wall: 79.09s. Confirms APR forward path under direct prompts is semantically correct. Matches §60 closure. PRED-61-A RED — but in an unexpected way: apr run <GGUF teacher> emits byte-identical "ampiezza = 0.5\ndiametro = 10\naltezza = 20\n# Calcolo del volume\nvolume = (" across THREE distinct prompts: 1. "What is 2+2? The answer is " (direct continuation) 2. "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" (ChatML) 3. "Hello, my name is" (conversational, no question) Wall times: 48.73s / 48.68s / 39.65s — different (proving inference IS running, not cached), but output text matches byte-for-byte. This is a PROMPT-INSENSITIVE GGUF generation bug — input tokens are dropped, ignored, or the model state is initialized to a fixed configuration before forward pass starts. Five-Whys for the §61.8 amendment: 1. Why §61.8? Both PRED-61-A and PRED-61-B fired; need durable record. 2. Why three prompts on GGUF? PRED-61-A's RED outcome in unexpected shape required disambiguation — was it ChatML-specific or structural? Three distinct prompts confirm structural. 3. Why does this matter? §61's 2-way picture (APR ChatML BROKEN / APR direct WORKING) was incomplete. Reality is 3-way: APR direct WORKING, APR ChatML BROKEN with \ns\ns repetition, GGUF any-prompt BROKEN with prompt-insensitive canned output. 4. Why split into two branches? Branch A (APR ChatML) and Branch B (GGUF prompt-insensitive) are independent — different code paths, different failure modes, different fix scopes. 5. Why methodology lesson #8? PRED-61-A asked "is GGUF + ChatML clean?" and the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control (Hello), the §61.8 taxonomy would have collapsed into "all paths broken under ChatML" which would mis-localize. §61.8 amendments to spec (1 file): - Atomic next action banner: v3.06.0 → v3.07.0 - Add §61.8 sub-section above the closing --- divider of §61, with: - 61.8.0: empirical PRED firing (apr run examples + outputs) - 61.8.1: refined 3-way bug taxonomy (table) - 61.8.2: Branch A vs Branch B independent investigation cascades - 61.8.3: ship-% movement (stays 92%) + per-SHIP* blocker mapping - 61.8.4: methodology lesson #8 (RED outcome may surface different bug) Evidence (NEW directory): - evidence/section-61-8-pred-fired-2026-05-10/ - pred-61-b-apr-direct.txt (29 lines, "4" output) - pred-61-a-gguf-direct.txt (32 lines, Italian "ampiezza...") - pred-61-a-gguf-chatml.txt (32 lines, byte-identical Italian) - gguf-third-prompt.txt (28 lines, "Hello..." → byte-identical) - findings.json (structured 3-way taxonomy + investigation branches) Validation: - Section format consistent with §61.1-61.7 (numbered §61.X.N sub- sub-sections under §61.8). - All evidence files referenced in spec body. - Methodological alignment: zero eprintln!, all evidence via apr run + tail to text files. Spec movement: - v3.06.0 → v3.07.0 - MODEL-1 ship %: stays at 92% (snapshot, not falsifier flip). - MODEL-2 ship %: unchanged at 57%. Refs: - evidence/section-61-8-pred-fired-2026-05-10/findings.json - SPEC-SHIP-TWO-001 §61.5 (PRED-61-A/B definitions) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #30 PMAT-CODE-SHIP-TWO-SECTION-61-8. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 10, 2026 14:18

noahgift enabled auto-merge (squash) May 10, 2026 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
noahgift wants to merge 2 commits intomainfrom
docs/ship-two-spec-section-61-8-pred-fired

noahgift commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 10, 2026

Summary

Predictions Fired

Refined 3-Way Bug Taxonomy

Two Independent Investigation Branches

§17.5 PARTIALs Per Branch

Methodology Lesson #8

Ship-% Movement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant