Skip to content

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611

Open
noahgift wants to merge 2 commits intomainfrom
docs/ship-two-spec-section-61-8-pred-fired
Open

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
noahgift wants to merge 2 commits intomainfrom
docs/ship-two-spec-section-61-8-pred-fired

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Same-day continuation of §61 (PR #1610). Both PRED-61-A and PRED-61-B fired live on canonical 7B teacher; surfaces a refined 3-way bug taxonomy.

Stacks on PR #1610 — this branch has §61 + §61.8 commits. When #1610 merges first, this PR will simplify to just §61.8.

Predictions Fired

PRED-61-B GREEN (predicted):

  • apr run <APR teacher> --prompt "What is 2+2? The answer is " --max-tokens 324
  • Wall: 79.09s. Confirms APR direct path is semantically correct.

PRED-61-A RED — but in an unexpected way:

  • apr run <GGUF teacher> emits byte-identical "ampiezza = 0.5\ndiametro = 10\n..." Italian gibberish across THREE distinct prompts (direct continuation / ChatML wrapper / conversational).
  • Wall times: 48.73s / 48.68s / 39.65s — different (inference IS running, not cached), but text matches byte-for-byte.
  • This is a prompt-insensitive structural bug in GGUF inference path.

Refined 3-Way Bug Taxonomy

Path Output Verdict Bug scope
APR + direct Coherent, prompt-correlated WORKING Matches §60
APR + ChatML "\ns\ns\ns…" degenerate BROKEN APR-side ChatML special-token handling
GGUF + any prompt Byte-identical "ampiezza..." BROKEN GGUF input-handling/state-init

Two Independent Investigation Branches

  • Branch A: APR ChatML degenerate-output. Bisect via apr trace --payload on layer-0 attn_norm at first generated-token position.
  • Branch B: GGUF prompt-insensitive canned-output. Instrument realizar::inference::forward to log actual token IDs reaching embedding lookup.

§17.5 PARTIALs Per Branch

  • SHIP-006 (apr qa golden_output) co-blocked on Branch A AND Branch B
  • SHIP-008 (chat template render) blocked on Branch A
  • SHIP-005 (HumanEval) likely blocked on Branch B
  • SHIP-007 (decode tps ≥ 30) likely blocked on Branch B

Methodology Lesson #8

A falsifier's RED outcome may surface a DIFFERENT bug class than the one being investigated. PRED-61-A asked "is GGUF + ChatML clean?" — the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control ("Hello"), §61.8's 3-way taxonomy would have collapsed into "all paths broken under ChatML" — mis-localizing.

Ship-% Movement

  • MODEL-1 ship %: stays at 92% (refines picture, does NOT ship a fix or LIVE-discharge).
  • MODEL-2 ship %: unchanged at 57% (gated on step 5g.3).

🤖 Generated with Claude Code

noahgift and others added 2 commits May 10, 2026 14:18
…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61)

Records the empirical findings from this session's LIVE-discharge
cascade attempt off §60. Two-track outcome:

DIRECT PROMPT (SHIP-002): GREEN.
`apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr
--prompt "def fib(n):" --max-tokens 128` produces clean fib() Python
(`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE
discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 →
v1.12.0).

CHATML PROMPT (SHIP-006/008): BLOCKED.
Same canonical 7B teacher fails `apr qa golden_output` gate with
"gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper
`<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`.
Same model + same engine + different prompt format → different
output regime.

The §60 closure proved per-layer FORWARD parity within Q4K tolerance
(layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove
GENERATION parity under arbitrary prompt distributions. §61 separates
these two invariants and surfaces the asymmetry as a NEW finding.

Five-Whys for the §61 amendment:
1. Why is §61 needed? §60 closed forward parity but SHIP-006/008
   LIVE-discharge attempts failed empirically.
2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding
   criterion only at the activation-stats level; arg-max sampling
   under cumulative drift is not directly bounded.
3. Why does prompt format matter? Direct prompts ("def fib(n):") put
   model in high-confidence next-token regime where small drift
   doesn't flip arg-max. ChatML prompts (instruction-following,
   chain-of-thought initialization) put model in low-margin regime
   where drift CAN flip arg-max.
4. Why record this in spec rather than just fix? The bug is multi-PR
   scope (special-token handling vs cumulative drift bisection
   needed). PRED-61-A/B set up the next falsifiable diagnostic step.
5. Why now (durable spec rather than evidence-only)? Each day the
   spec doesn't reflect the §60 → §61 separation, future sessions
   may misinterpret §60 closure as full SHIP-007-class discharge.

§61.5 falsifiable predictions:
- PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN,
  bug is APR-side in chat-template handling.
- PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer
  is " (no ChatML wrapper) → clean output? If GREEN, bug is special-
  token handling NOT cumulative drift.

If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to
"APR + ChatML special-token path" — multi-PR scope but tractable.

Changes (1 file):
- docs/specifications/aprender-train/ship-two-models-spec.md
  - Atomic next action banner: v3.05.0 → v3.06.0; new banner
    summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE,
    SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up).
  - New §61 section above §58 (newest-first ordering): 7
    sub-sections (61.1 separation table, 61.2 direct-prompt evidence,
    61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale,
    61.5 falsifiable next investigation step, 61.6 ship-% movement,
    61.7 what §61 is NOT).

Validation:
- Spec section format consistent with §58 (newest-first, dated, sub-
  sections numbered §61.X).
- All 6 cascade PRs from this session referenced explicitly (#1604,
  #1606, #1607, #1608, #1609, this PR).
- Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs).
- Methodological alignment: zero eprintln!, zero bash workarounds;
  all evidence captured via existing apr CLI primitives.

Refs:
- evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory)
- contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED)
- contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608)
- ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure)

Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… taxonomy (PMAT-CODE-SHIP-TWO-SECTION-61-8)

Same-day continuation of §61. Both falsifiable predictions fired on
noah-Lambda-Vector RTX 4090 (apr v0.32.0 post-e856eb91f).

PRED-61-B GREEN (predicted):
  apr run <APR teacher> --prompt "What is 2+2? The answer is " → "4"
  Wall: 79.09s. Confirms APR forward path under direct prompts is
  semantically correct. Matches §60 closure.

PRED-61-A RED — but in an unexpected way:
  apr run <GGUF teacher> emits byte-identical
    "ampiezza = 0.5\ndiametro = 10\naltezza = 20\n# Calcolo del volume\nvolume = ("
  across THREE distinct prompts:
    1. "What is 2+2? The answer is " (direct continuation)
    2. "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" (ChatML)
    3. "Hello, my name is" (conversational, no question)
  Wall times: 48.73s / 48.68s / 39.65s — different (proving inference
  IS running, not cached), but output text matches byte-for-byte.

This is a PROMPT-INSENSITIVE GGUF generation bug — input tokens are
dropped, ignored, or the model state is initialized to a fixed
configuration before forward pass starts.

Five-Whys for the §61.8 amendment:
1. Why §61.8? Both PRED-61-A and PRED-61-B fired; need durable record.
2. Why three prompts on GGUF? PRED-61-A's RED outcome in unexpected
   shape required disambiguation — was it ChatML-specific or
   structural? Three distinct prompts confirm structural.
3. Why does this matter? §61's 2-way picture (APR ChatML BROKEN /
   APR direct WORKING) was incomplete. Reality is 3-way: APR direct
   WORKING, APR ChatML BROKEN with \ns\ns repetition, GGUF any-prompt
   BROKEN with prompt-insensitive canned output.
4. Why split into two branches? Branch A (APR ChatML) and Branch B
   (GGUF prompt-insensitive) are independent — different code paths,
   different failure modes, different fix scopes.
5. Why methodology lesson #8? PRED-61-A asked "is GGUF + ChatML clean?"
   and the answer is "no, but for an entirely different reason than
   ChatML special-token handling". Without the third-prompt control
   (Hello), the §61.8 taxonomy would have collapsed into "all paths
   broken under ChatML" which would mis-localize.

§61.8 amendments to spec (1 file):
- Atomic next action banner: v3.06.0 → v3.07.0
- Add §61.8 sub-section above the closing --- divider of §61, with:
  - 61.8.0: empirical PRED firing (apr run examples + outputs)
  - 61.8.1: refined 3-way bug taxonomy (table)
  - 61.8.2: Branch A vs Branch B independent investigation cascades
  - 61.8.3: ship-% movement (stays 92%) + per-SHIP* blocker mapping
  - 61.8.4: methodology lesson #8 (RED outcome may surface different bug)

Evidence (NEW directory):
- evidence/section-61-8-pred-fired-2026-05-10/
  - pred-61-b-apr-direct.txt (29 lines, "4" output)
  - pred-61-a-gguf-direct.txt (32 lines, Italian "ampiezza...")
  - pred-61-a-gguf-chatml.txt (32 lines, byte-identical Italian)
  - gguf-third-prompt.txt (28 lines, "Hello..." → byte-identical)
  - findings.json (structured 3-way taxonomy + investigation branches)

Validation:
- Section format consistent with §61.1-61.7 (numbered §61.X.N sub-
  sub-sections under §61.8).
- All evidence files referenced in spec body.
- Methodological alignment: zero eprintln!, all evidence via apr
  run + tail to text files.

Spec movement:
- v3.06.0 → v3.07.0
- MODEL-1 ship %: stays at 92% (snapshot, not falsifier flip).
- MODEL-2 ship %: unchanged at 57%.

Refs:
- evidence/section-61-8-pred-fired-2026-05-10/findings.json
- SPEC-SHIP-TWO-001 §61.5 (PRED-61-A/B definitions)
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure)

Closes task #30 PMAT-CODE-SHIP-TWO-SECTION-61-8.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 10, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant