Conversation
…ETRAIN-INIT-LOAD-003) Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on lambda-vector RTX 4090: H4 ROOT CAUSE #1: BF16 dtype mislabel ====================================== The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags its tensors with dtype=F16 in the APR v2 header — but the SOURCE HF safetensors `model.safetensors` uses dtype=BF16. When the loader sees dtype=F16, it dequantizes via `f16_to_f32`, producing values that diverge from the BF16-correct decode. Element-0 cross-check on `model.norm.weight`: Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ... Old APR (loaded as F16): 7.0625, 7.125, 7.0, ... Fresh APR (loaded as BF16): 7.5625, 8.0, 7.21875, ... ← matches source Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`: Safetensors (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... Old APR (F16): (different, distorted) Fresh APR (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... ← matches source Fix: re-import Qwen safetensors via current `apr import`. The current `StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line 100-104 of streaming_writer.rs). The old APR was created with a buggy import path that mis-tagged BF16 as F16. H4 ROOT CAUSE #2: STILL OPEN ============================= Even with correct BF16-decoded weights (fresh APR), val_loss at step 1 is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab baseline). The dtype fix moved the dial slightly (was 19.80) but did not resolve the sub-random predictions issue. Remaining hypotheses for the residual gap: H4B (layout): some tensor's row/col-major orientation may differ between Qwen export and aprender::Transformer expectations H4D (forward path): cuBLAS / CudaBlock forward may produce wrong logits despite correct weight values Other: tied embedding fall-through path (`lm_head: None` → embed_tokens reuse) may have a sign or scale issue This PR ships the diagnostic infrastructure that PROVED root cause #1 and provides the foundation for bisecting the remaining gap. What this PR ships =================== `falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated diagnostic test that: - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy) - Reports tensor stats (mean, std, min, max) for embed_tokens, final norm, per-layer norms, q/k/v projections, mlp gates - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in [0.01, 100], etc.) - Dumps element-0 values for cross-comparison with safetensors source Industrial validation example output: embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — sensible HF LLaMA init scale model.norm.weight: mean=7.46, std=0.84, range[-2.28, 17.38] — Qwen-typical (final norm scaled up) q_proj.bias L0: mean=0.03, std=7.88, range[-65.5, 128] — Qwen-typical (large attention biases) Five-Whys ========== 1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass `is_bf16` flag through to the writer. Fixed in current apr-cli but the artifact is preserved on disk. 2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — meaning forward path or some other tensor is still producing sub-random predictions. 3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". The PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large. 4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves H4 root cause #1 and provides the bisection foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is real progress. 5. Why does the operator need to know? They have an old Qwen APR on disk that mis-decodes silently. With this PR's diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle (per §60 evidence). Quality gates (all green) ========================== - cargo test -p aprender-train --lib falsify_h4_init_stats: PASS - cargo test -p aprender-train --lib: 7585+ tests PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR) - rustfmt --check: clean SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix available (use fresh APR), but val_loss still >ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted. - §60 H1C cascade: FULLY CLOSED per #1598 - §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22) Out-of-scope follow-ups ======================== PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade): - Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer - Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 10, 2026
…al root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004) (#1602) H4 cascade bisection: BUG IS IN CUDA PATH. EMPIRICAL FINDING CPU `aprender::Transformer::forward` on a populated Qwen 0.5B model (fresh APR, BF16-correct dtype) produces SENSIBLE logits: populated: 290/290 tensors logits: n=151936 nan=0 inf=0 min=-15.03 max=11.72 mean=-3.33 std=2.65 peak-to-mean ratio = 5.68 argmax = 9370 (specific token, not flat) This means: - Populate path: GREEN (all 290 Qwen tensors loaded) - CPU forward: GREEN (clean logits, sensible distribution) - lm_head tied-embedding fall-through: GREEN (matmul produces proper logit distribution despite lm_head=None) H4 ROOT CAUSE LOCALIZATION (post this PR): | Component | Pre-this-PR | Post-this-PR | |-----------|-------------|--------------| | BF16 dtype tag | OPEN | FIXED #1 (PR #1601) | | Populate (290/290) | OPEN | FALSIFIED — works ✓ | | CPU forward | OPEN | FALSIFIED — works ✓ | | Tied embedding | OPEN | FALSIFIED — works ✓ | | **CUDA path** | OPEN | **CONFIRMED LIVE BUG** | Empirical contrast: CPU forward: argmax=9370 with confident peak (peak-to-mean=5.68) CUDA eval_batch: val_loss > ln(vocab) = sub-random predictions Same weights, same arch, different backend → CUDA forward path distorts the result. Three CUDA-side sub-hypotheses for the next session: H4D.1 — `CudaTransformerTrainer::with_model` upload distorts weights during H2D transfer H4D.2 — `gpu_forward` CUDA kernels (cuBLAS GEMM, RoPE, fused attention, RMSNorm) produce wrong outputs despite correct inputs H4D.3 — `fused_cross_entropy_cuda` reads from a wrong buffer location (off-by-stride in logits_buf) Five-Whys 1. Why does val_loss=18.55 > ln(vocab)=17.21 with fresh APR? Because the CUDA forward path produces sub-random logits even though CPU forward on the same weights produces sensible ones. 2. Why does CUDA differ from CPU? Because the bug is in one of: GPU upload, GPU kernels, or eval_batch's cross_entropy buffer handling. CPU path is end-to-end clean. 3. Why didn't existing falsifiers catch this? Per `feedback_test_methodology_can_fake_bugs.md`, the CUDA path was validated by convergence on synthetic data (§44/§45) and from-scratch (§50.4 cascade) — both blind to forward-pass parity vs CPU reference. 4. Why ship the CPU bisect instead of fixing CUDA directly? Because pinpointing the bug at the BACKEND boundary (CPU vs CUDA) is the cheapest narrowing. Without this, the next agent would have to re-derive that the CPU side works. 5. Why does this matter for ship %? With H4 narrowed to CUDA, the next falsifier-discharge cascade (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001) has a clear scope: CPU↔CUDA forward parity test, dump per-layer hidden states, identify divergence point. What this PR ships `falsify_h4_cpu_forward_qwen_logits_sensible` — host-gated test that loads Qwen 0.5B (fresh APR preferred), populates a polymorphic Transformer, forward-passes a single token, and asserts: - logits are finite (no NaN/Inf) - logits std > 0.01 (not constant) - peak-to-mean > 1.5 (not uniform) - argmax in [0, vocab_size) (proper shape) Empirical run: PASSES on RTX 4090 host with fresh APR. Quality gates - cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS - rustfmt --check: clean - cargo clippy -p aprender-train --lib -- -D warnings: clean SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED to the CUDA path. The CPU path is provably correct. Next-cycle bisection has a tight scope (3 sub-hypotheses, all CUDA-specific). - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-004 (task #23) Out-of-scope follow-ups PMAT-CODE-PRETRAIN-CUDA-FORWARD-001: - Author CPU↔CUDA forward parity falsifier on populated Qwen - Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer) - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
H4 cascade bisect: root cause #1 FOUND and fixed (BF16 dtype mislabel in old Qwen APR), root cause #2 STILL OPEN (val_loss=18.55 still > ln(vocab)=17.21 with fresh APR).
This PR ships the diagnostic falsifier infrastructure that pinned root cause #1 via element-by-element cross-check against the HF safetensors source.
H4 Root Cause #1: BF16 dtype mislabel
The OLD
qwen2.5-coder-0.5b-instruct-fp16.apr(May-4 import) tags tensors as dtype=F16 in the APR v2 header. The SOURCE HF safetensors uses dtype=BF16. The loader sees F16 and dequantizes viaf16_to_f32, producing distorted values.Element-0 cross-check on
model.norm.weight(n=896):7.5625, 8.0, 7.21875, 7.3125, 7.46875, 7.3757.0625, 7.125, 7.0, 7.0625, 6.75, 6.875(wrong)7.5625, 8.0, 7.21875, 7.3125, 7.46875, 7.375✓ matchesFix: re-import via current
apr import. The currentStreamingWriter::add_raw_f16_tensorcorrectly preserves BF16 (line 100-104). The old APR was created with a buggy import path that mis-tagged BF16 as F16.H4 Root Cause #2: STILL OPEN
Even with correct BF16 weights (fresh APR), val_loss at step 1 is 18.55 — still above ln(vocab) = 17.21 (uniform-over-vocab). The dtype fix moved the dial slightly (was 19.80 with old APR) but didn't resolve the sub-random predictions.
Remaining hypotheses for the residual gap:
lm_head: None→ embed_tokens reuse) may have sign/scale issueWhat this PR ships
falsify_h4_init_stats_qwen_embed_norm_sensible— a host-gated diagnostic test that:Industrial validation example:
Five-Whys
is_bf16flag through to the writer. Fixed in current apr-cli; old artifact preserved on disk.feedback_falsifier_first_cascade_pattern.md, 1 PR ≈ 1 falsifier discharge.Test plan
cargo test -p aprender-train --lib falsify_h4_init_stats: PASScargo test -p aprender-train --lib: 7585+ tests PASScargo clippy -p aprender-train --lib -- -D warnings: clean (lib-only; --tests has 4 PRE-EXISTING errors on main, not introduced by this PR)rustfmt --check: cleanSHIP-TWO impact
Out-of-scope follow-ups
PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade):
Files
crates/aprender-train/src/train/pretrain_real.rs(+189/-77, falsify_h4_init_stats_qwen_embed_norm_sensible test)🤖 Generated with Claude Code