docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) by noahgift · Pull Request #1580 · paiml/aprender

noahgift · 2026-05-09T06:19:03Z

Summary

Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR #1579's populate-coverage fix applied. The data empirically confirms H1 (eval_batch degenerate) as the dominant remaining defect — H2 (populate gap) was a real fix but was NOT the root cause of the val_loss anomaly.

The smoking gun

At epoch 0 (after 100 training steps), the model has:

Metric	Value	Plausibility
train_loss	1.20 (perplexity 3.33)	✅ PLAUSIBLE for Qwen 0.5B fine-tuning on Python
val_loss	0.00081 (perplexity 1.0008)	❌ Physically IMPOSSIBLE for non-degenerate LM

1500× train/eval discrepancy at the same model state. Same kernel (fused_cross_entropy_cuda), same scaling (1.0/seq_len), same forward path. Different batches, both Python code from the same shards.

H2 was REAL but NOT the dominant cause

Run	train_loss	val_loss	Interpretation
2026-05-08 pre-fix (PR #1578)	0.0019	0.0008	H2 + H1 compounding
2026-05-09 1-step post-fix	2.24	0.628	H2 fixed; H1 still skews val_loss
2026-05-09 500-step post-fix	1.20	0.00081	H2 fixed; H1 dominant

The PR #1579 fix moved train_loss from 0.0019 (degenerate) to 1.20 (plausible) — a 1000× shift confirming structural completeness. But val_loss did NOT shift correspondingly: 0.0008 → 0.00075. Eval pipeline is independent of the populate gap.

Three H1 sub-hypotheses (each its own falsifier-discharge cascade)

A) logits_buf state contamination — train_batch writes gradients in-place (KAIZEN-052); eval_batch's gpu_forward may not fully overwrite, leaving stale gradients that cross_entropy reads as "logits."
B) Stream synchronization — host reads loss_partials before kernel finishes; stream.synchronize() should prevent this but a silent kernel failure could leave the buffer at zero.
C) Held-out batch label corruption — pathological structure where get_target returns same tokens as get_input. Hard to hit by accident on real Python; least likely.

Why ship the evidence + contract bump but not the fix?

PR atomicity (feedback_falsifier_first_cascade_pattern.md). Each H1 sub-hypothesis is its own falsifier-discharge cascade. Shipping the audit trail NOW preserves the discovery for the next session and unblocks the operator from re-deriving it.

Contract bump

contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 → v1.1.0:

status: DRAFT → DRAFT_PARTIAL_DISCHARGE
5/6 falsifiers DISCHARGED, 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT
Promotion to ACTIVE_RUNTIME requires H1 resolved AND re-dispatch producing val_loss in 1.5-2.5 plausible range

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
MODEL-2 ship %: unchanged at 57% (still gated on honest 5g.3 verdict; this evidence is the audit trail showing why the prior numerical pass was not honest)
§50.4 cascade: COMPLETE per feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577
5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577) with structurally-complete model (PR feat(aprender-train): respect config.use_bias in attention constructor (PMAT-CODE-PRETRAIN-INIT-POPULATE-COVERAGE-001) #1579) but HONEST 5g.3 verdict remains gated on H1 resolution

Test plan

pv validate contracts/apr-pretrain-init-finetune-v1.yaml — 0 errors
Documentation-only change (no Rust code, no falsifier semantics flip)
Evidence pinned at dispatch.txt (.log gitignored)

Files

contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0)
evidence/section-60-5g-2-redispatch-2026-05-09/
- dispatch.txt
- epoch-{000,001,002}.metadata.json
- README.md — H1/H2 hypothesis decomposition + audit
.pv/lint-previous.json (refresh)

Next steps (out-of-scope follow-ups)

PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks:

Author CudaTransformerTrainer::eval_batch sanity-bound test (assert loss > 0.5 on random-init + synthetic batch)
Bisect H1 sub-hypotheses A/B/C with targeted instrumentation
Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict

🤖 Generated with Claude Code

…ce (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR H1 (eval_batch degenerate) as the dominant remaining defect — H2 (populate gap) was a real fix but was NOT the root cause of the val_loss anomaly. The smoking gun ================ At epoch 0 (after 100 training steps), the model has: train_loss = 1.20 (PLAUSIBLE for Qwen 0.5B fine-tuning on Python) val_loss = 0.00081 (perplexity 1.0008 — physically IMPOSSIBLE for a non-degenerate LM) **1500× train/eval discrepancy at the same model state.** Same kernel (`fused_cross_entropy_cuda`), same scaling (`1.0/seq_len`), same forward path (`gpu_forward` → `gpu_training.logits_buf`). Different batches but both Python code from the same shards. H2 was REAL but NOT the dominant cause ======================================== PR #1579 fixed `MultiHeadAttention::new` to allocate Q/K/V biases when `config.use_bias=true`. The fix moved train_loss from 0.0019 (degenerate, pre-fix) to 1.20 (plausible) — a 1000× shift confirming structural completeness. But val_loss did NOT shift correspondingly: 0.0008 (pre-fix) → 0.00075 (post-fix). The eval pipeline returned essentially the same ~0 number both before and after the H2 fix, indicating H1 is independent of H2. Five-Whys ========= 1. Why is val_loss=0.00075 implausibly low? The model assigns probability ≈0.9992 to every held-out token; physically impossible for an LM that hasn't seen those exact sequences. 2. Why same kernel produces train_loss=1.20 but val_loss=0.00075? The two share the same kernel but differ in something upstream that the kernel reads. 3. Three sub-hypotheses for "something upstream": A) `logits_buf` state contamination — train_batch writes gradients in-place (KAIZEN-052); eval_batch's gpu_forward may not fully overwrite, leaving stale gradients that cross_entropy reads as "logits". B) Stream synchronization — host reads loss_partials before kernel finishes; stream.synchronize() should prevent this but a silent kernel failure could leave the buffer at zero. C) Held-out batch label corruption — pathological structure where get_target returns same tokens as get_input. Hard to hit by accident on real Python; least likely. 4. Why didn't existing falsifiers catch this? The gap is between the kernel-level contract (proven correct in unit tests on synthetic logits) and the high-level dispatch (no falsifier asserts CudaTransformerTrainer::eval_batch produces a loss in a sensible range for known input). H1 is a between-contracts gap, same class as the H2 gap PR #1579 closed. 5. Why ship the evidence + contract bump but not the fix? PR atomicity (`feedback_falsifier_first_cascade_pattern.md`). Each H1 sub-hypothesis (A/B/C) is its own falsifier-discharge cascade. Shipping the audit trail NOW preserves the discovery for the next session and unblocks the operator from re-deriving it. Contract bump ============= `contracts/apr-pretrain-init-finetune-v1.yaml` v1.0.0 → v1.1.0: status: DRAFT → DRAFT_PARTIAL_DISCHARGE Records the 5/6 DISCHARGED + 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT state. Promotion to ACTIVE_RUNTIME requires H1 resolved AND a re-dispatch producing val_loss in 1.5-2.5 plausible range. SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% (still gated on honest 5g.3 verdict; this evidence is the audit trail showing why the prior numerical pass was not honest) - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) with structurally-complete model (PR #1579) but the HONEST 5g.3 verdict remains gated on H1 resolution Quality gates (this PR) ======================== - pv validate contracts/apr-pretrain-init-finetune-v1.yaml: 0 errors - Documentation-only change (no Rust code, no falsifier semantics flip) - Evidence pinned at dispatch.txt (.log gitignored; renamed) Files ===== - contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0) - evidence/section-60-5g-2-redispatch-2026-05-09/ dispatch.txt epoch-{000,001,002}.metadata.json README.md (H1/H2 hypothesis decomposition + audit) Out-of-scope follow-ups (each its own falsifier-discharge cascade) ================================================================= PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks: - Author CudaTransformerTrainer::eval_batch sanity-bound test (assert loss > 0.5 on random-init + synthetic batch) - Bisect H1 sub-hypotheses A/B/C with targeted instrumentation - Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… level (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) (#1581) Adds two CUDA-gated falsifier unit tests in pretrain_real_cuda.rs::tests that probe the H1 (eval_batch degenerate) hypothesis surfaced by PR #1580's evidence (1500× train/val discrepancy at the same model state, post H2-fix). Both tests PASS on lambda-vector RTX 4090, EMPIRICALLY FALSIFYING H1 hypothesis A (`logits_buf` train→eval state pollution at the unit-test level). The production bug must therefore be something that does NOT manifest in: - tiny model (2 layers, hidden=64, vocab=1000) - random-init weights (no Qwen pretrained) - synthetic random tokens (no real Python from Qwen tokenizer) - seq_len=16 batches - 1 train_batch step The 1500× discrepancy in production likely requires one of: - real Qwen 0.5B model size + weights - real seq_len=512 batches - real Python tokens (specific tokenizer-vocab patterns) - many train steps (state accumulation effects) - an interaction not captured by unit-level reproducer Five-Whys for landing GREEN falsifiers (rather than waiting for fix): 1. Why ship GREEN falsifiers if they don't reproduce the bug? The tests still prove H1A is FALSIFIED at unit level — that's a real positive contribution to the hypothesis decomposition even though they don't catch the actual production bug. 2. Why isn't this just "wait until you find the bug"? Per `feedback_falsifier_first_cascade_pattern.md`: 1 PR ≈ 1 falsifier discharge. The "H1A falsified at unit level" is itself a discharge. The production-level bug needs a different reproducer (probably a smaller-but-real-Qwen integration test). 3. Why two tests instead of one? - 001 (sanity bound) — checks fresh-init eval_batch returns loss ∈ [0.5, 1.5×ln(vocab)]; catches the simplest H1 form. - 002 (train→eval pollution) — checks eval_batch is not contaminated by train_batch's in-place gradient writeback; directly tests hypothesis A. 4. Why CUDA-gated rather than universal? `CudaTransformerTrainer::new` requires CUDA runtime. The tests run only when the operator (or a CUDA CI lane) explicitly passes `--features cuda`. Default CI sees only the `#[cfg(test)]` mod stub, so no breakage. 5. What does this NOT cover? - H1B (stream sync) — not directly tested; would need a deliberate kernel-failure injection. - H1C (held-out label corruption) — not tested; would need to inspect actual production held_out tokens for pathological patterns. - H1 at production scale — needs an integration test with real Qwen model + real tokens. Test details falsify_eval_batch_h1_sanity_bound: - tiny config (vocab=1000), random init - synthetic batch (4 × 16 tokens, LCG-deterministic) - eval_batch returns loss ≈ ln(1000) = 6.91 - asserts loss ∈ [0.5, 1.5×ln(vocab)] = [0.5, 10.4] - PASSED on RTX 4090 falsify_eval_batch_h1_train_pollution: - same tiny config + random init - two distinct synthetic batches: train_batch_data + eval_batch_data - sequence: eval_batch(eval_data) → train_batch(train_data) → eval_batch(eval_data) - asserts |loss_b - loss_a| / loss_a < 0.95 (1% drop allowed, 1500× drop forbidden — the production observation would correspond to ~99.93% relative drop) - PASSED on RTX 4090 Hypothesis status update | Sub-hypothesis | Pre-this-PR | Post-this-PR | |---|---|---| | H1A (logits_buf train→eval pollution) | OPEN suspected | **FALSIFIED at unit level** | | H1B (stream synchronization) | OPEN | OPEN (not tested) | | H1C (held-out label corruption) | OPEN | OPEN (not tested) | | H1 at production scale | OPEN | OPEN (needs integration test) | The H1A falsification narrows the hypothesis space. Next-cycle falsifiers should target H1B (stream sync) or H1C (held-out content) or full-scale integration with a smaller-but-real Qwen checkpoint. Quality gates - pv validate (no contract change in this PR) - cargo test -p aprender-train --features cuda --lib falsify_eval_batch_h1: 2/2 PASS on RTX 4090 - cargo test -p aprender-train --lib (default features): tests gated out, no CI breakage - rustfmt --check: clean - cargo clippy -p aprender-train --lib -- -D warnings: clean SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% (H1 still open at production scale) - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE; HONEST 5g.3 verdict still gated on H1 resolution at production scale Out-of-scope follow-ups (each its own falsifier-discharge cascade) - H1 at production scale: integration test with smaller-but-real Qwen checkpoint + real Python tokens. - H1B stream-sync probe: deliberate kernel-failure injection + loss_partials-buffer state inspection. - H1C held-out content audit: dump first 16 batches of the 5g.1 corpus for pathological patterns (low entropy, repeated tokens). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 9, 2026 06:19

noahgift mentioned this pull request May 9, 2026

test(aprender-train): H1 falsifiers FALSIFY hypothesis A at unit-test level (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) #1581

Merged

4 tasks

noahgift force-pushed the docs/h1-eval-batch-cuda-divergence-evidence branch from c4aef32 to f8d1a5d Compare May 9, 2026 12:22

noahgift mentioned this pull request May 9, 2026

fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585

Open

7 tasks

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

055a210

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)#1580

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)#1580
noahgift wants to merge 2 commits intomainfrom
docs/h1-eval-batch-cuda-divergence-evidence

noahgift commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 9, 2026

Summary

The smoking gun

H2 was REAL but NOT the dominant cause

Three H1 sub-hypotheses (each its own falsifier-discharge cascade)

Why ship the evidence + contract bump but not the fix?

Contract bump

SHIP-TWO impact

Test plan

Files

Next steps (out-of-scope follow-ups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant