Conversation
4 tasks
…ce (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)
Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR
H1 (eval_batch degenerate) as the dominant remaining defect — H2
(populate gap) was a real fix but was NOT the root cause of the
val_loss anomaly.
The smoking gun
================
At epoch 0 (after 100 training steps), the model has:
train_loss = 1.20 (PLAUSIBLE for Qwen 0.5B fine-tuning on Python)
val_loss = 0.00081 (perplexity 1.0008 — physically IMPOSSIBLE for
a non-degenerate LM)
**1500× train/eval discrepancy at the same model state.** Same
kernel (`fused_cross_entropy_cuda`), same scaling (`1.0/seq_len`),
same forward path (`gpu_forward` → `gpu_training.logits_buf`).
Different batches but both Python code from the same shards.
H2 was REAL but NOT the dominant cause
========================================
PR #1579 fixed `MultiHeadAttention::new` to allocate Q/K/V biases
when `config.use_bias=true`. The fix moved train_loss from 0.0019
(degenerate, pre-fix) to 1.20 (plausible) — a 1000× shift confirming
structural completeness.
But val_loss did NOT shift correspondingly: 0.0008 (pre-fix) →
0.00075 (post-fix). The eval pipeline returned essentially the same
~0 number both before and after the H2 fix, indicating H1 is
independent of H2.
Five-Whys
=========
1. Why is val_loss=0.00075 implausibly low? The model assigns
probability ≈0.9992 to every held-out token; physically
impossible for an LM that hasn't seen those exact sequences.
2. Why same kernel produces train_loss=1.20 but val_loss=0.00075?
The two share the same kernel but differ in something upstream
that the kernel reads.
3. Three sub-hypotheses for "something upstream":
A) `logits_buf` state contamination — train_batch writes
gradients in-place (KAIZEN-052); eval_batch's gpu_forward
may not fully overwrite, leaving stale gradients that
cross_entropy reads as "logits".
B) Stream synchronization — host reads loss_partials before
kernel finishes; stream.synchronize() should prevent this
but a silent kernel failure could leave the buffer at zero.
C) Held-out batch label corruption — pathological structure
where get_target returns same tokens as get_input. Hard
to hit by accident on real Python; least likely.
4. Why didn't existing falsifiers catch this? The gap is between
the kernel-level contract (proven correct in unit tests on
synthetic logits) and the high-level dispatch (no falsifier
asserts CudaTransformerTrainer::eval_batch produces a loss in
a sensible range for known input). H1 is a between-contracts
gap, same class as the H2 gap PR #1579 closed.
5. Why ship the evidence + contract bump but not the fix? PR
atomicity (`feedback_falsifier_first_cascade_pattern.md`).
Each H1 sub-hypothesis (A/B/C) is its own falsifier-discharge
cascade. Shipping the audit trail NOW preserves the discovery
for the next session and unblocks the operator from re-deriving
it.
Contract bump
=============
`contracts/apr-pretrain-init-finetune-v1.yaml` v1.0.0 → v1.1.0:
status: DRAFT → DRAFT_PARTIAL_DISCHARGE
Records the 5/6 DISCHARGED + 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT
state. Promotion to ACTIVE_RUNTIME requires H1 resolved AND a
re-dispatch producing val_loss in 1.5-2.5 plausible range.
SHIP-TWO impact
================
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% (still gated on honest 5g.3
verdict; this evidence is the audit trail showing why the prior
numerical pass was not honest)
- §50.4 cascade: COMPLETE per #1577
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) with
structurally-complete model (PR #1579) but the HONEST 5g.3
verdict remains gated on H1 resolution
Quality gates (this PR)
========================
- pv validate contracts/apr-pretrain-init-finetune-v1.yaml: 0 errors
- Documentation-only change (no Rust code, no falsifier semantics flip)
- Evidence pinned at dispatch.txt (.log gitignored; renamed)
Files
=====
- contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0)
- evidence/section-60-5g-2-redispatch-2026-05-09/
dispatch.txt
epoch-{000,001,002}.metadata.json
README.md (H1/H2 hypothesis decomposition + audit)
Out-of-scope follow-ups (each its own falsifier-discharge cascade)
=================================================================
PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks:
- Author CudaTransformerTrainer::eval_batch sanity-bound test
(assert loss > 0.5 on random-init + synthetic batch)
- Bisect H1 sub-hypotheses A/B/C with targeted instrumentation
- Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c4aef32 to
f8d1a5d
Compare
noahgift
added a commit
that referenced
this pull request
May 9, 2026
… level (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) (#1581) Adds two CUDA-gated falsifier unit tests in pretrain_real_cuda.rs::tests that probe the H1 (eval_batch degenerate) hypothesis surfaced by PR #1580's evidence (1500× train/val discrepancy at the same model state, post H2-fix). Both tests PASS on lambda-vector RTX 4090, EMPIRICALLY FALSIFYING H1 hypothesis A (`logits_buf` train→eval state pollution at the unit-test level). The production bug must therefore be something that does NOT manifest in: - tiny model (2 layers, hidden=64, vocab=1000) - random-init weights (no Qwen pretrained) - synthetic random tokens (no real Python from Qwen tokenizer) - seq_len=16 batches - 1 train_batch step The 1500× discrepancy in production likely requires one of: - real Qwen 0.5B model size + weights - real seq_len=512 batches - real Python tokens (specific tokenizer-vocab patterns) - many train steps (state accumulation effects) - an interaction not captured by unit-level reproducer Five-Whys for landing GREEN falsifiers (rather than waiting for fix): 1. Why ship GREEN falsifiers if they don't reproduce the bug? The tests still prove H1A is FALSIFIED at unit level — that's a real positive contribution to the hypothesis decomposition even though they don't catch the actual production bug. 2. Why isn't this just "wait until you find the bug"? Per `feedback_falsifier_first_cascade_pattern.md`: 1 PR ≈ 1 falsifier discharge. The "H1A falsified at unit level" is itself a discharge. The production-level bug needs a different reproducer (probably a smaller-but-real-Qwen integration test). 3. Why two tests instead of one? - 001 (sanity bound) — checks fresh-init eval_batch returns loss ∈ [0.5, 1.5×ln(vocab)]; catches the simplest H1 form. - 002 (train→eval pollution) — checks eval_batch is not contaminated by train_batch's in-place gradient writeback; directly tests hypothesis A. 4. Why CUDA-gated rather than universal? `CudaTransformerTrainer::new` requires CUDA runtime. The tests run only when the operator (or a CUDA CI lane) explicitly passes `--features cuda`. Default CI sees only the `#[cfg(test)]` mod stub, so no breakage. 5. What does this NOT cover? - H1B (stream sync) — not directly tested; would need a deliberate kernel-failure injection. - H1C (held-out label corruption) — not tested; would need to inspect actual production held_out tokens for pathological patterns. - H1 at production scale — needs an integration test with real Qwen model + real tokens. Test details falsify_eval_batch_h1_sanity_bound: - tiny config (vocab=1000), random init - synthetic batch (4 × 16 tokens, LCG-deterministic) - eval_batch returns loss ≈ ln(1000) = 6.91 - asserts loss ∈ [0.5, 1.5×ln(vocab)] = [0.5, 10.4] - PASSED on RTX 4090 falsify_eval_batch_h1_train_pollution: - same tiny config + random init - two distinct synthetic batches: train_batch_data + eval_batch_data - sequence: eval_batch(eval_data) → train_batch(train_data) → eval_batch(eval_data) - asserts |loss_b - loss_a| / loss_a < 0.95 (1% drop allowed, 1500× drop forbidden — the production observation would correspond to ~99.93% relative drop) - PASSED on RTX 4090 Hypothesis status update | Sub-hypothesis | Pre-this-PR | Post-this-PR | |---|---|---| | H1A (logits_buf train→eval pollution) | OPEN suspected | **FALSIFIED at unit level** | | H1B (stream synchronization) | OPEN | OPEN (not tested) | | H1C (held-out label corruption) | OPEN | OPEN (not tested) | | H1 at production scale | OPEN | OPEN (needs integration test) | The H1A falsification narrows the hypothesis space. Next-cycle falsifiers should target H1B (stream sync) or H1C (held-out content) or full-scale integration with a smaller-but-real Qwen checkpoint. Quality gates - pv validate (no contract change in this PR) - cargo test -p aprender-train --features cuda --lib falsify_eval_batch_h1: 2/2 PASS on RTX 4090 - cargo test -p aprender-train --lib (default features): tests gated out, no CI breakage - rustfmt --check: clean - cargo clippy -p aprender-train --lib -- -D warnings: clean SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% (H1 still open at production scale) - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE; HONEST 5g.3 verdict still gated on H1 resolution at production scale Out-of-scope follow-ups (each its own falsifier-discharge cascade) - H1 at production scale: integration test with smaller-but-real Qwen checkpoint + real Python tokens. - H1B stream-sync probe: deliberate kernel-failure injection + loss_partials-buffer state inspection. - H1C held-out content audit: dump first 16 batches of the 5g.1 corpus for pathological patterns (low entropy, repeated tokens). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR #1579's populate-coverage fix applied. The data empirically confirms H1 (
eval_batchdegenerate) as the dominant remaining defect — H2 (populate gap) was a real fix but was NOT the root cause of the val_loss anomaly.The smoking gun
At epoch 0 (after 100 training steps), the model has:
1500× train/eval discrepancy at the same model state. Same kernel (
fused_cross_entropy_cuda), same scaling (1.0/seq_len), same forward path. Different batches, both Python code from the same shards.H2 was REAL but NOT the dominant cause
The PR #1579 fix moved train_loss from 0.0019 (degenerate) to 1.20 (plausible) — a 1000× shift confirming structural completeness. But val_loss did NOT shift correspondingly: 0.0008 → 0.00075. Eval pipeline is independent of the populate gap.
Three H1 sub-hypotheses (each its own falsifier-discharge cascade)
logits_bufstate contamination —train_batchwrites gradients in-place (KAIZEN-052);eval_batch'sgpu_forwardmay not fully overwrite, leaving stale gradients that cross_entropy reads as "logits."loss_partialsbefore kernel finishes;stream.synchronize()should prevent this but a silent kernel failure could leave the buffer at zero.get_targetreturns same tokens asget_input. Hard to hit by accident on real Python; least likely.Why ship the evidence + contract bump but not the fix?
PR atomicity (
feedback_falsifier_first_cascade_pattern.md). Each H1 sub-hypothesis is its own falsifier-discharge cascade. Shipping the audit trail NOW preserves the discovery for the next session and unblocks the operator from re-deriving it.Contract bump
contracts/apr-pretrain-init-finetune-v1.yamlv1.0.0 → v1.1.0:SHIP-TWO impact
Test plan
pv validate contracts/apr-pretrain-init-finetune-v1.yaml— 0 errorsdispatch.txt(.loggitignored)Files
contracts/apr-pretrain-init-finetune-v1.yaml(v1.0.0 → v1.1.0)evidence/section-60-5g-2-redispatch-2026-05-09/dispatch.txtepoch-{000,001,002}.metadata.jsonREADME.md— H1/H2 hypothesis decomposition + audit.pv/lint-previous.json(refresh)Next steps (out-of-scope follow-ups)
PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks:
CudaTransformerTrainer::eval_batchsanity-bound test (assert loss > 0.5 on random-init + synthetic batch)🤖 Generated with Claude Code