Skip to content

docs(spec): SHIP-TWO-001 §61 — post-§60 LIVE-discharge cascade + ChatML generation gap#1610

Merged
noahgift merged 1 commit into
mainfrom
docs/ship-two-spec-section-61-post-60-generation-gap
May 10, 2026
Merged

docs(spec): SHIP-TWO-001 §61 — post-§60 LIVE-discharge cascade + ChatML generation gap#1610
noahgift merged 1 commit into
mainfrom
docs/ship-two-spec-section-61-post-60-generation-gap

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. §60 closed forward parity (layer-3 ratio 18.23× → 1.245× ∈ [0.5, 2.0]); §61 surfaces that forward parity ≠ generation parity under all prompt formats.

Two-track outcome

DIRECT prompt → GREEN: SHIP-002 LIVE-discharged via PR #1609. apr run --prompt "def fib(n):" --max-tokens 128 on canonical 7B APR teacher emits coherent fib() Python (ast.parse 0 syntax errors, 68 nodes, 1 FunctionDef "fib").

ChatML-wrapped prompt → BLOCKED: SHIP-006/008 cannot LIVE-discharge today. Same canonical teacher fails apr qa golden_output gate with "gibberish (fragment '\ns\ns' repeats 3+ times)" under <|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n.

Five-Whys

  1. Why §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically.
  2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion at activation-stats level; arg-max sampling under cumulative drift not directly bounded.
  3. Why does prompt format matter? Direct prompts → high-confidence next-token regime where drift doesn't flip arg-max. ChatML → low-margin regime where drift CAN flip arg-max.
  4. Why record vs fix now? Bug is multi-PR scope; PRED-61-A/B set up next bisection.
  5. Why durable spec? Each day spec doesn't reflect §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge.

§61.5 falsifiable next predictions

  • PRED-61-A: GGUF + ChatML on canonical 7B → clean? If GREEN, bug is APR-side in chat-template handling.
  • PRED-61-B: APR + direct continuation "What is 2+2? The answer is " (no ChatML) → clean? If GREEN, bug is special-token handling NOT cumulative drift.

Changes (1 file)

  • docs/specifications/aprender-train/ship-two-models-spec.md:
    • Atomic next action banner: v3.05.0 → v3.06.0
    • New §61 section (above §58, newest-first ordering) with 7 sub-sections (separation table / direct-prompt evidence / ChatML-prompt evidence / §60→§61 separation rationale / falsifiable predictions / ship-% movement / what §61 is NOT)

Ship-% Movement

Cascade-this-session

6 PRs in 24h working SHIP-TWO-001:

Test Plan

  • No code changes; pure documentation
  • Section format consistent with §58 (newest-first, sub-sections numbered §61.X)
  • All 6 cascade PRs referenced explicitly
  • Methodological alignment: zero eprintln!, all evidence via existing apr CLI primitives

🤖 Generated with Claude Code

…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61)

Records the empirical findings from this session's LIVE-discharge
cascade attempt off §60. Two-track outcome:

DIRECT PROMPT (SHIP-002): GREEN.
`apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr
--prompt "def fib(n):" --max-tokens 128` produces clean fib() Python
(`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE
discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 →
v1.12.0).

CHATML PROMPT (SHIP-006/008): BLOCKED.
Same canonical 7B teacher fails `apr qa golden_output` gate with
"gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper
`<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`.
Same model + same engine + different prompt format → different
output regime.

The §60 closure proved per-layer FORWARD parity within Q4K tolerance
(layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove
GENERATION parity under arbitrary prompt distributions. §61 separates
these two invariants and surfaces the asymmetry as a NEW finding.

Five-Whys for the §61 amendment:
1. Why is §61 needed? §60 closed forward parity but SHIP-006/008
   LIVE-discharge attempts failed empirically.
2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding
   criterion only at the activation-stats level; arg-max sampling
   under cumulative drift is not directly bounded.
3. Why does prompt format matter? Direct prompts ("def fib(n):") put
   model in high-confidence next-token regime where small drift
   doesn't flip arg-max. ChatML prompts (instruction-following,
   chain-of-thought initialization) put model in low-margin regime
   where drift CAN flip arg-max.
4. Why record this in spec rather than just fix? The bug is multi-PR
   scope (special-token handling vs cumulative drift bisection
   needed). PRED-61-A/B set up the next falsifiable diagnostic step.
5. Why now (durable spec rather than evidence-only)? Each day the
   spec doesn't reflect the §60 → §61 separation, future sessions
   may misinterpret §60 closure as full SHIP-007-class discharge.

§61.5 falsifiable predictions:
- PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN,
  bug is APR-side in chat-template handling.
- PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer
  is " (no ChatML wrapper) → clean output? If GREEN, bug is special-
  token handling NOT cumulative drift.

If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to
"APR + ChatML special-token path" — multi-PR scope but tractable.

Changes (1 file):
- docs/specifications/aprender-train/ship-two-models-spec.md
  - Atomic next action banner: v3.05.0 → v3.06.0; new banner
    summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE,
    SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up).
  - New §61 section above §58 (newest-first ordering): 7
    sub-sections (61.1 separation table, 61.2 direct-prompt evidence,
    61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale,
    61.5 falsifiable next investigation step, 61.6 ship-% movement,
    61.7 what §61 is NOT).

Validation:
- Spec section format consistent with §58 (newest-first, dated, sub-
  sections numbered §61.X).
- All 6 cascade PRs from this session referenced explicitly (#1604,
  #1606, #1607, #1608, #1609, this PR).
- Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs).
- Methodological alignment: zero eprintln!, zero bash workarounds;
  all evidence captured via existing apr CLI primitives.

Refs:
- evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory)
- contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED)
- contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608)
- ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure)

Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 10, 2026 12:20
@noahgift noahgift merged commit bbbdf6b into main May 10, 2026
11 checks passed
@noahgift noahgift deleted the docs/ship-two-spec-section-61-post-60-generation-gap branch May 10, 2026 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant