docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
Open
docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
Conversation
…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61) Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. Two-track outcome: DIRECT PROMPT (SHIP-002): GREEN. `apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr --prompt "def fib(n):" --max-tokens 128` produces clean fib() Python (`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0). CHATML PROMPT (SHIP-006/008): BLOCKED. Same canonical 7B teacher fails `apr qa golden_output` gate with "gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper `<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`. Same model + same engine + different prompt format → different output regime. The §60 closure proved per-layer FORWARD parity within Q4K tolerance (layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove GENERATION parity under arbitrary prompt distributions. §61 separates these two invariants and surfaces the asymmetry as a NEW finding. Five-Whys for the §61 amendment: 1. Why is §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically. 2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion only at the activation-stats level; arg-max sampling under cumulative drift is not directly bounded. 3. Why does prompt format matter? Direct prompts ("def fib(n):") put model in high-confidence next-token regime where small drift doesn't flip arg-max. ChatML prompts (instruction-following, chain-of-thought initialization) put model in low-margin regime where drift CAN flip arg-max. 4. Why record this in spec rather than just fix? The bug is multi-PR scope (special-token handling vs cumulative drift bisection needed). PRED-61-A/B set up the next falsifiable diagnostic step. 5. Why now (durable spec rather than evidence-only)? Each day the spec doesn't reflect the §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge. §61.5 falsifiable predictions: - PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN, bug is APR-side in chat-template handling. - PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer is " (no ChatML wrapper) → clean output? If GREEN, bug is special- token handling NOT cumulative drift. If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to "APR + ChatML special-token path" — multi-PR scope but tractable. Changes (1 file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.05.0 → v3.06.0; new banner summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE, SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up). - New §61 section above §58 (newest-first ordering): 7 sub-sections (61.1 separation table, 61.2 direct-prompt evidence, 61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale, 61.5 falsifiable next investigation step, 61.6 ship-% movement, 61.7 what §61 is NOT). Validation: - Spec section format consistent with §58 (newest-first, dated, sub- sections numbered §61.X). - All 6 cascade PRs from this session referenced explicitly (#1604, #1606, #1607, #1608, #1609, this PR). - Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs). - Methodological alignment: zero eprintln!, zero bash workarounds; all evidence captured via existing apr CLI primitives. Refs: - evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory) - contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED) - contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608) - ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… taxonomy (PMAT-CODE-SHIP-TWO-SECTION-61-8)
Same-day continuation of §61. Both falsifiable predictions fired on
noah-Lambda-Vector RTX 4090 (apr v0.32.0 post-e856eb91f).
PRED-61-B GREEN (predicted):
apr run <APR teacher> --prompt "What is 2+2? The answer is " → "4"
Wall: 79.09s. Confirms APR forward path under direct prompts is
semantically correct. Matches §60 closure.
PRED-61-A RED — but in an unexpected way:
apr run <GGUF teacher> emits byte-identical
"ampiezza = 0.5\ndiametro = 10\naltezza = 20\n# Calcolo del volume\nvolume = ("
across THREE distinct prompts:
1. "What is 2+2? The answer is " (direct continuation)
2. "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" (ChatML)
3. "Hello, my name is" (conversational, no question)
Wall times: 48.73s / 48.68s / 39.65s — different (proving inference
IS running, not cached), but output text matches byte-for-byte.
This is a PROMPT-INSENSITIVE GGUF generation bug — input tokens are
dropped, ignored, or the model state is initialized to a fixed
configuration before forward pass starts.
Five-Whys for the §61.8 amendment:
1. Why §61.8? Both PRED-61-A and PRED-61-B fired; need durable record.
2. Why three prompts on GGUF? PRED-61-A's RED outcome in unexpected
shape required disambiguation — was it ChatML-specific or
structural? Three distinct prompts confirm structural.
3. Why does this matter? §61's 2-way picture (APR ChatML BROKEN /
APR direct WORKING) was incomplete. Reality is 3-way: APR direct
WORKING, APR ChatML BROKEN with \ns\ns repetition, GGUF any-prompt
BROKEN with prompt-insensitive canned output.
4. Why split into two branches? Branch A (APR ChatML) and Branch B
(GGUF prompt-insensitive) are independent — different code paths,
different failure modes, different fix scopes.
5. Why methodology lesson #8? PRED-61-A asked "is GGUF + ChatML clean?"
and the answer is "no, but for an entirely different reason than
ChatML special-token handling". Without the third-prompt control
(Hello), the §61.8 taxonomy would have collapsed into "all paths
broken under ChatML" which would mis-localize.
§61.8 amendments to spec (1 file):
- Atomic next action banner: v3.06.0 → v3.07.0
- Add §61.8 sub-section above the closing --- divider of §61, with:
- 61.8.0: empirical PRED firing (apr run examples + outputs)
- 61.8.1: refined 3-way bug taxonomy (table)
- 61.8.2: Branch A vs Branch B independent investigation cascades
- 61.8.3: ship-% movement (stays 92%) + per-SHIP* blocker mapping
- 61.8.4: methodology lesson #8 (RED outcome may surface different bug)
Evidence (NEW directory):
- evidence/section-61-8-pred-fired-2026-05-10/
- pred-61-b-apr-direct.txt (29 lines, "4" output)
- pred-61-a-gguf-direct.txt (32 lines, Italian "ampiezza...")
- pred-61-a-gguf-chatml.txt (32 lines, byte-identical Italian)
- gguf-third-prompt.txt (28 lines, "Hello..." → byte-identical)
- findings.json (structured 3-way taxonomy + investigation branches)
Validation:
- Section format consistent with §61.1-61.7 (numbered §61.X.N sub-
sub-sections under §61.8).
- All evidence files referenced in spec body.
- Methodological alignment: zero eprintln!, all evidence via apr
run + tail to text files.
Spec movement:
- v3.06.0 → v3.07.0
- MODEL-1 ship %: stays at 92% (snapshot, not falsifier flip).
- MODEL-2 ship %: unchanged at 57%.
Refs:
- evidence/section-61-8-pred-fired-2026-05-10/findings.json
- SPEC-SHIP-TWO-001 §61.5 (PRED-61-A/B definitions)
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure)
Closes task #30 PMAT-CODE-SHIP-TWO-SECTION-61-8.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Same-day continuation of §61 (PR #1610). Both PRED-61-A and PRED-61-B fired live on canonical 7B teacher; surfaces a refined 3-way bug taxonomy.
Stacks on PR #1610 — this branch has §61 + §61.8 commits. When #1610 merges first, this PR will simplify to just §61.8.
Predictions Fired
PRED-61-B GREEN (predicted):
apr run <APR teacher> --prompt "What is 2+2? The answer is " --max-tokens 32→4✓PRED-61-A RED — but in an unexpected way:
apr run <GGUF teacher>emits byte-identical"ampiezza = 0.5\ndiametro = 10\n..."Italian gibberish across THREE distinct prompts (direct continuation / ChatML wrapper / conversational).Refined 3-Way Bug Taxonomy
"\ns\ns\ns…"degenerate"ampiezza..."Two Independent Investigation Branches
apr trace --payloadon layer-0 attn_norm at first generated-token position.realizar::inference::forwardto log actual token IDs reaching embedding lookup.§17.5 PARTIALs Per Branch
Methodology Lesson #8
A falsifier's RED outcome may surface a DIFFERENT bug class than the one being investigated. PRED-61-A asked "is GGUF + ChatML clean?" — the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control ("Hello"), §61.8's 3-way taxonomy would have collapsed into "all paths broken under ChatML" — mis-localizing.
Ship-% Movement
🤖 Generated with Claude Code