Skip to content

docs(evidence): §61 — 5g.1 re-encode SUCCESS, 5g.2 honest dispatch surfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)#1600

Open
noahgift wants to merge 2 commits intomainfrom
docs/section-61-5g-1-re-encode-success-2026-05-10
Open

docs(evidence): §61 — 5g.1 re-encode SUCCESS, 5g.2 honest dispatch surfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)#1600
noahgift wants to merge 2 commits intomainfrom
docs/section-61-5g-1-re-encode-success-2026-05-10

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

TL;DR

  • 5g.1 re-encode SUCCESS: 1.24 B Python tokens, 0% unk, 7.42 bits entropy (was 99.99% unk / 0.001 bits in §60's broken corpus)
  • 5g.2 LIVE dispatch ABORTED: val_loss=11.55 at epoch 0 (> 10.0 threshold) → divergence guard fires
  • NEW defect surface H4: val_loss=19.80 at step 1 — worse than ln(vocab)=17.21 (sub-random predictions). Qwen init loads but is structurally broken somewhere

What worked: §60 data-bug FULLY CLOSED

Metric §60 broken corpus §61 fixed corpus
Distinct tokens 2 3324
Shannon entropy 0.001 bits 7.415 bits
Unk ratio 99.99% 0.00%
Top tokens <unk>, </s> Ġ-prefix, \n, common Python bigrams

PR #1598's encoder fix processes the full 3 GB Python corpus correctly — 1.24 B tokens / 405 K docs across 126 shards in ~5 min wall (vs 17 hr broken-mode).

What broke (5g.2): val_loss > ln(vocab)

Diagnostic 1-step run on Qwen 0.5B init + 5g.1-v2 corpus:

  • val_loss = 19.80 at step 1
  • ln(vocab=151643) = 17.21 (uniform-over-vocab baseline)
  • Industry baseline Qwen 0.5B zero-shot on Python: ~1.5–3.0

val_loss > ln(vocab) means the model is anti-aligned with held-out tokens — worse than random init.

H4 candidate hypotheses

  • H4A (tied weights): Qwen 0.5B has tie_word_embeddings: true. If populate writes embed_tokens but lm_head goes to random init, predictions are random while embeddings are correct
  • H4B (layout): GGUF/APR is row-major (per tensor-layout-v1). If init APR lm_head is column-major, matmul produces wrong logits
  • H4C (norm scale): RMSNorm weights loaded but rms_norm_eps mismatch cascades through forward
  • H4D (residual stream): Some block's residual contributes zero from uninitialized buffer

Falsifier ledger update

Falsifier Pre-§61 Post-§61
FALSIFY-005 (val_loss < 9.38) NUMERICALLY-PASSED-METHODOLOGY-SUSPECT (data-bug fake pass) RED-WITH-METHODOLOGICALLY-HONEST (real defect on real corpus)

The status flip from fake-pass to honest-RED is itself progress — the contract now reports the binding defect.

Five-Whys

  1. Why val_loss=19.80 at step 1? Industry-baseline is 1.5–3.0; ln(vocab) is 17.21. 19.80 > 17.21 means anti-aligned predictions.
  2. Why anti-aligned despite PR feat(aprender-train): respect config.use_bias in attention constructor (PMAT-CODE-PRETRAIN-INIT-POPULATE-COVERAGE-001) #1579's fix? PR feat(aprender-train): respect config.use_bias in attention constructor (PMAT-CODE-PRETRAIN-INIT-POPULATE-COVERAGE-001) #1579 fixed Q/K/V bias allocation. H4 is a different gap — likely tied weights (H4A) or layout (H4B).
  3. Four hypotheses scope: Each is its own falsifier-discharge cascade per feedback_falsifier_first_cascade_pattern.md.
  4. Why ship diagnosis but not H4 fix? Multi-PR scope. The honest verdict + hypothesis decomposition unblocks the next session's bisection work.
  5. Why does this matter for ship %? §60 data-bug cascade is FULLY CLOSED. The honest RED on real data is the path forward — was previously masked by the fake-pass.

SHIP-TWO impact

  • MODEL-1 ship %: unchanged at 91%
  • MODEL-2 ship %: unchanged at 57% — diagnosis correct, H4 cascade is the gate
  • §60 H1C cascade: FULLY CLOSED. Encoder works end-to-end on real Qwen vocab + real Python corpus.
  • 5g.1: SHIPPED (real corpus on disk: /mnt/.../codeparrot-python-permissive-shards-qwen-v2)
  • Closes: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task Implement Apriori Algorithm for Association Rule Mining #21)
  • Tracks: PMAT-CODE-PRETRAIN-INIT-LOAD-003 (H4 cascade) — next ship-mover

Test plan

  • LIVE 5g.1 re-encode: 1.24 B tokens, 0% unk
  • Entropy audit on shard-0 first 32K: 7.42 bits / 17.21 max
  • LIVE 5g.2 500-step dispatch: GATE-TRAIN-005 abort at val_loss=11.55
  • LIVE 5g.2 1-step diagnostic: val_loss=19.80 > ln(vocab)
  • Documentation only (no Rust/contract changes in this PR)

Files

  • evidence/section-61-5g-1-re-encode-2026-05-10/README.md (NEW, full audit + H4 hypotheses)
  • evidence/section-61-5g-1-re-encode-2026-05-10/dispatch.txt (NEW, encode log)
  • evidence/section-61-5g-2-honest-2026-05-10/dispatch.txt (NEW, 5g.2 dispatch log)

🤖 Generated with Claude Code

…rfaces H4 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)

Records the full discharge of PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003
(task #21) and the new H4 defect surface that the honest data
exposed.

Two artifacts:

1. **5g.1 re-encode SUCCESS** — `apr tokenize encode-corpus` with
   PR #1598's upfront vocab-format detection produced a real Python
   corpus from the 3.0 GB JSONL source:
     - 1,241.7 M tokens
     - 405,944 documents
     - 126 shards × 10 M tokens each
     - Shard-0 first 32K: entropy 7.42 bits / 17.21 max; 3324 distinct
       tokens; **0% unk** (was 99.99% unk in §60's broken corpus)
   The data-bug from §60 is fully closed.

2. **5g.2 LIVE dispatch surfaces H4** — Re-running fine-tune from
   Qwen 0.5B init on the now-real corpus aborted at GATE-TRAIN-005:
     - 500-step run: val_loss = 11.55 at epoch 0 (> 10.0 threshold)
     - 1-step diagnostic: val_loss = 19.80 (> ln(vocab) = 17.21)
   val_loss > ln(vocab) means the model assigns LESS than uniform
   probability to true tokens — *worse than random init*. The Qwen
   init weights load (PR #1579's populate-coverage fix is in main)
   but produce sub-random predictions.

Five-Whys

1. Why was val_loss = 19.80 at step 1? Industry baseline for Qwen
   0.5B zero-shot on Python is ~1.5–3.0; uniform random over vocab
   is ln(151643) = 17.21. 19.80 > 17.21 means the model is
   *anti-aligned* with held-out tokens.
2. Why anti-aligned despite Qwen init being loaded? Some structural
   component of the init pipeline is broken at a layer that PR #1579
   doesn't cover.
3. Four hypotheses for H4:
     A. Tied weights — `tie_word_embeddings: true` on Qwen 0.5B; if
        populate writes embed_tokens but doesn't propagate to
        lm_head (or writes them separately to random buffers),
        forward predictions are random while embeddings are correct.
     B. Layout mismatch — GGUF/APR are row-major (tensor-layout-v1);
        if init APR's lm_head is column-major, matmul produces
        wrong logits.
     C. Norm scale — RMSNorm weights loaded but rms_norm_eps mismatch
        cascades through forward.
     D. Residual stream — some block's residual contributes zero from
        an uninitialized buffer.
4. Why ship the diagnosis but not the H4 fix? Each hypothesis is its
   own falsifier-discharge cascade per `feedback_falsifier_first_cascade_pattern.md`.
   Multi-PR scope.
5. Why does this matter for ship %? FALSIFY-005 status flips from
   NUMERICALLY-PASSED-METHODOLOGY-SUSPECT (pre-§61, fake pass on
   broken corpus) to RED-WITH-METHODOLOGICALLY-HONEST (post-§61,
   real defect on real corpus). The honest RED is itself progress
   — the contract now reports the binding defect.

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% — diagnosis correct, H4 cascade
  is the gate
- §60 H1C (data-bug) cascade: FULLY CLOSED. Encoder works
  end-to-end on real Qwen vocab + real Python corpus.

Closes PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (task #21).

Tracking PMAT-CODE-PRETRAIN-INIT-LOAD-003 (H4 cascade) as the next
ship-mover.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 10, 2026 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant