Conversation
…al root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004)
H4 cascade bisection: BUG IS IN CUDA PATH.
EMPIRICAL FINDING
CPU `aprender::Transformer::forward` on a populated Qwen 0.5B
model (fresh APR, BF16-correct dtype) produces SENSIBLE logits:
populated: 290/290 tensors
logits: n=151936 nan=0 inf=0
min=-15.03 max=11.72 mean=-3.33 std=2.65
peak-to-mean ratio = 5.68
argmax = 9370 (specific token, not flat)
This means:
- Populate path: GREEN (all 290 Qwen tensors loaded)
- CPU forward: GREEN (clean logits, sensible distribution)
- lm_head tied-embedding fall-through: GREEN (matmul produces
proper logit distribution despite lm_head=None)
H4 ROOT CAUSE LOCALIZATION (post this PR):
| Component | Pre-this-PR | Post-this-PR |
|-----------|-------------|--------------|
| BF16 dtype tag | OPEN | FIXED #1 (PR #1601) |
| Populate (290/290) | OPEN | FALSIFIED — works ✓ |
| CPU forward | OPEN | FALSIFIED — works ✓ |
| Tied embedding | OPEN | FALSIFIED — works ✓ |
| **CUDA path** | OPEN | **CONFIRMED LIVE BUG** |
Empirical contrast:
CPU forward: argmax=9370 with confident peak (peak-to-mean=5.68)
CUDA eval_batch: val_loss > ln(vocab) = sub-random predictions
Same weights, same arch, different backend → CUDA forward path
distorts the result. Three CUDA-side sub-hypotheses for the next
session:
H4D.1 — `CudaTransformerTrainer::with_model` upload distorts
weights during H2D transfer
H4D.2 — `gpu_forward` CUDA kernels (cuBLAS GEMM, RoPE, fused
attention, RMSNorm) produce wrong outputs despite correct
inputs
H4D.3 — `fused_cross_entropy_cuda` reads from a wrong buffer
location (off-by-stride in logits_buf)
Five-Whys
1. Why does val_loss=18.55 > ln(vocab)=17.21 with fresh APR?
Because the CUDA forward path produces sub-random logits even
though CPU forward on the same weights produces sensible ones.
2. Why does CUDA differ from CPU? Because the bug is in one of:
GPU upload, GPU kernels, or eval_batch's cross_entropy buffer
handling. CPU path is end-to-end clean.
3. Why didn't existing falsifiers catch this? Per `feedback_test_methodology_can_fake_bugs.md`,
the CUDA path was validated by convergence on synthetic data
(§44/§45) and from-scratch (§50.4 cascade) — both blind to
forward-pass parity vs CPU reference.
4. Why ship the CPU bisect instead of fixing CUDA directly?
Because pinpointing the bug at the BACKEND boundary (CPU vs
CUDA) is the cheapest narrowing. Without this, the next agent
would have to re-derive that the CPU side works.
5. Why does this matter for ship %? With H4 narrowed to CUDA,
the next falsifier-discharge cascade (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001)
has a clear scope: CPU↔CUDA forward parity test, dump per-layer
hidden states, identify divergence point.
What this PR ships
`falsify_h4_cpu_forward_qwen_logits_sensible` — host-gated test
that loads Qwen 0.5B (fresh APR preferred), populates a polymorphic
Transformer, forward-passes a single token, and asserts:
- logits are finite (no NaN/Inf)
- logits std > 0.01 (not constant)
- peak-to-mean > 1.5 (not uniform)
- argmax in [0, vocab_size) (proper shape)
Empirical run: PASSES on RTX 4090 host with fresh APR.
Quality gates
- cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS
- rustfmt --check: clean
- cargo clippy -p aprender-train --lib -- -D warnings: clean
SHIP-TWO impact
- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED
to the CUDA path. The CPU path is provably correct. Next-cycle
bisection has a tight scope (3 sub-hypotheses, all CUDA-specific).
- This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-004 (task #23)
Out-of-scope follow-ups
PMAT-CODE-PRETRAIN-CUDA-FORWARD-001:
- Author CPU↔CUDA forward parity falsifier on populated Qwen
- Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer)
- Fix root cause; flip MODEL-2 ship % 57% → ≥58%
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
H4 LOCALIZED TO CUDA PATH. CPU
aprender::Transformer::forwardon populated Qwen 0.5B produces SENSIBLE logits (clean argmax=9370, peak-to-mean=5.68). The bug is in CUDA upload or GPU kernels — not in populate, CPU forward, or tied-embedding fall-through.Empirical bisection result
CPU forward on populated Qwen 0.5B (fresh APR, BF16-correct):
CUDA
eval_batchon same weights:val_loss > ln(vocab)(sub-random).Same weights, same arch, different backend → CUDA path is the bug.
H4 component status
Three CUDA-side sub-hypotheses (next-cycle work)
CudaTransformerTrainer::with_modelupload distorts weights during H2Dgpu_forwardCUDA kernels (cuBLAS GEMM / RoPE / RMSNorm / fused attention) produce wrong outputsfused_cross_entropy_cudareads wrong buffer location (off-by-stride in logits_buf)Each is testable via CPU↔CUDA forward parity on populated Qwen.
Five-Whys
Test plan
cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASSrustfmt --check: cleancargo clippy -p aprender-train --lib -- -D warnings: cleanSHIP-TWO impact
Out-of-scope follow-ups
PMAT-CODE-PRETRAIN-CUDA-FORWARD-001:
Files
crates/aprender-train/src/train/pretrain_real.rs(+110, falsify_h4_cpu_forward_qwen_logits_sensible)🤖 Generated with Claude Code