feat(aprender-train): H4 init-load diagnostic — finds BF16 dtype mislabel root cause #1 (PMAT-CODE-PRETRAIN-INIT-LOAD-003) by noahgift · Pull Request #1601 · paiml/aprender

noahgift · 2026-05-10T07:45:42Z

TL;DR

H4 cascade bisect: root cause #1 FOUND and fixed (BF16 dtype mislabel in old Qwen APR), root cause #2 STILL OPEN (val_loss=18.55 still > ln(vocab)=17.21 with fresh APR).

This PR ships the diagnostic falsifier infrastructure that pinned root cause #1 via element-by-element cross-check against the HF safetensors source.

H4 Root Cause #1: BF16 dtype mislabel

The OLD qwen2.5-coder-0.5b-instruct-fp16.apr (May-4 import) tags tensors as dtype=F16 in the APR v2 header. The SOURCE HF safetensors uses dtype=BF16. The loader sees F16 and dequantizes via f16_to_f32, producing distorted values.

Element-0 cross-check on model.norm.weight (n=896):

Source	Values [0..6]
Safetensors (BF16-decoded)	`7.5625, 8.0, 7.21875, 7.3125, 7.46875, 7.375`
Old APR (loaded as F16)	`7.0625, 7.125, 7.0, 7.0625, 6.75, 6.875` (wrong)
Fresh APR (loaded as BF16)	`7.5625, 8.0, 7.21875, 7.3125, 7.46875, 7.375` ✓ matches

Fix: re-import via current apr import. The current StreamingWriter::add_raw_f16_tensor correctly preserves BF16 (line 100-104). The old APR was created with a buggy import path that mis-tagged BF16 as F16.

H4 Root Cause #2: STILL OPEN

Even with correct BF16 weights (fresh APR), val_loss at step 1 is 18.55 — still above ln(vocab) = 17.21 (uniform-over-vocab). The dtype fix moved the dial slightly (was 19.80 with old APR) but didn't resolve the sub-random predictions.

Remaining hypotheses for the residual gap:

H4B layout: tensor row/col-major mismatch between Qwen export and aprender expectations
H4D forward path: cuBLAS / CudaBlock may produce wrong logits despite correct weights
Other: tied embedding fall-through (lm_head: None → embed_tokens reuse) may have sign/scale issue

What this PR ships

falsify_h4_init_stats_qwen_embed_norm_sensible — a host-gated diagnostic test that:

Loads Qwen 0.5B init APR (prefers fresh, falls back to legacy)
Reports tensor stats (mean, std, min, max) for key tensors
Dumps element-0 values for cross-comparison with safetensors source
Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm ∈ [0.01, 100])

Industrial validation example:

embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — HF LLaMA init scale ✓
model.norm.weight:   mean=7.46,    std=0.84,   range[-2.28, 17.38]  — Qwen-typical ✓
q_proj.bias L0:      mean=0.03,    std=7.88,   range[-65.5, 128]    — Qwen-typical ✓

Five-Whys

Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass is_bf16 flag through to the writer. Fixed in current apr-cli; old artifact preserved on disk.
Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — forward path or some other tensor is still producing sub-random predictions.
Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large.
Why ship the diagnostic before the full H4 fix? The diagnostic itself proves root cause Feature Request: Decision Tree & Random Forest for Classification Tasks #1 AND provides the bisection foundation for Feature Request: Cross-Validation Utilities #2. Per feedback_falsifier_first_cascade_pattern.md, 1 PR ≈ 1 falsifier discharge.
Why does the operator need to know? They have an old Qwen APR that mis-decodes silently. With this diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle.

Test plan

cargo test -p aprender-train --lib falsify_h4_init_stats: PASS
cargo test -p aprender-train --lib: 7585+ tests PASS
cargo clippy -p aprender-train --lib -- -D warnings: clean (lib-only; --tests has 4 PRE-EXISTING errors on main, not introduced by this PR)
rustfmt --check: clean
LIVE diagnostic on RTX 4090: confirmed dtype mismatch + element-0 divergence

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91%
MODEL-2 ship %: unchanged at 57% — H4 root cause Feature Request: Decision Tree & Random Forest for Classification Tasks #1 found and fix available (use fresh APR), but val_loss still > ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted.
§60 H1C cascade: FULLY CLOSED per fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) #1598
§61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk

Out-of-scope follow-ups

PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade):

Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer
Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul
Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Files

crates/aprender-train/src/train/pretrain_real.rs (+189/-77, falsify_h4_init_stats_qwen_embed_norm_sensible test)

🤖 Generated with Claude Code

…ETRAIN-INIT-LOAD-003) Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on lambda-vector RTX 4090: H4 ROOT CAUSE #1: BF16 dtype mislabel ====================================== The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags its tensors with dtype=F16 in the APR v2 header — but the SOURCE HF safetensors `model.safetensors` uses dtype=BF16. When the loader sees dtype=F16, it dequantizes via `f16_to_f32`, producing values that diverge from the BF16-correct decode. Element-0 cross-check on `model.norm.weight`: Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ... Old APR (loaded as F16): 7.0625, 7.125, 7.0, ... Fresh APR (loaded as BF16): 7.5625, 8.0, 7.21875, ... ← matches source Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`: Safetensors (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... Old APR (F16): (different, distorted) Fresh APR (BF16): 0.0674, -0.0859, 0.1104, -0.0605, ... ← matches source Fix: re-import Qwen safetensors via current `apr import`. The current `StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line 100-104 of streaming_writer.rs). The old APR was created with a buggy import path that mis-tagged BF16 as F16. H4 ROOT CAUSE #2: STILL OPEN ============================= Even with correct BF16-decoded weights (fresh APR), val_loss at step 1 is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab baseline). The dtype fix moved the dial slightly (was 19.80) but did not resolve the sub-random predictions issue. Remaining hypotheses for the residual gap: H4B (layout): some tensor's row/col-major orientation may differ between Qwen export and aprender::Transformer expectations H4D (forward path): cuBLAS / CudaBlock forward may produce wrong logits despite correct weight values Other: tied embedding fall-through path (`lm_head: None` → embed_tokens reuse) may have a sign or scale issue This PR ships the diagnostic infrastructure that PROVED root cause #1 and provides the foundation for bisecting the remaining gap. What this PR ships =================== `falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated diagnostic test that: - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy) - Reports tensor stats (mean, std, min, max) for embed_tokens, final norm, per-layer norms, q/k/v projections, mlp gates - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in [0.01, 100], etc.) - Dumps element-0 values for cross-comparison with safetensors source Industrial validation example output: embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — sensible HF LLaMA init scale model.norm.weight: mean=7.46, std=0.84, range[-2.28, 17.38] — Qwen-typical (final norm scaled up) q_proj.bias L0: mean=0.03, std=7.88, range[-65.5, 128] — Qwen-typical (large attention biases) Five-Whys ========== 1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass `is_bf16` flag through to the writer. Fixed in current apr-cli but the artifact is preserved on disk. 2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — meaning forward path or some other tensor is still producing sub-random predictions. 3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". The PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large. 4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves H4 root cause #1 and provides the bisection foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is real progress. 5. Why does the operator need to know? They have an old Qwen APR on disk that mis-decodes silently. With this PR's diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle (per §60 evidence). Quality gates (all green) ========================== - cargo test -p aprender-train --lib falsify_h4_init_stats: PASS - cargo test -p aprender-train --lib: 7585+ tests PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR) - rustfmt --check: clean SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix available (use fresh APR), but val_loss still >ln(vocab). The next-cycle bisection (H4B or H4D) is now well-targeted. - §60 H1C cascade: FULLY CLOSED per #1598 - §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22) Out-of-scope follow-ups ======================== PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade): - Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer - Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…al root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004) (#1602) H4 cascade bisection: BUG IS IN CUDA PATH. EMPIRICAL FINDING CPU `aprender::Transformer::forward` on a populated Qwen 0.5B model (fresh APR, BF16-correct dtype) produces SENSIBLE logits: populated: 290/290 tensors logits: n=151936 nan=0 inf=0 min=-15.03 max=11.72 mean=-3.33 std=2.65 peak-to-mean ratio = 5.68 argmax = 9370 (specific token, not flat) This means: - Populate path: GREEN (all 290 Qwen tensors loaded) - CPU forward: GREEN (clean logits, sensible distribution) - lm_head tied-embedding fall-through: GREEN (matmul produces proper logit distribution despite lm_head=None) H4 ROOT CAUSE LOCALIZATION (post this PR): | Component | Pre-this-PR | Post-this-PR | |-----------|-------------|--------------| | BF16 dtype tag | OPEN | FIXED #1 (PR #1601) | | Populate (290/290) | OPEN | FALSIFIED — works ✓ | | CPU forward | OPEN | FALSIFIED — works ✓ | | Tied embedding | OPEN | FALSIFIED — works ✓ | | **CUDA path** | OPEN | **CONFIRMED LIVE BUG** | Empirical contrast: CPU forward: argmax=9370 with confident peak (peak-to-mean=5.68) CUDA eval_batch: val_loss > ln(vocab) = sub-random predictions Same weights, same arch, different backend → CUDA forward path distorts the result. Three CUDA-side sub-hypotheses for the next session: H4D.1 — `CudaTransformerTrainer::with_model` upload distorts weights during H2D transfer H4D.2 — `gpu_forward` CUDA kernels (cuBLAS GEMM, RoPE, fused attention, RMSNorm) produce wrong outputs despite correct inputs H4D.3 — `fused_cross_entropy_cuda` reads from a wrong buffer location (off-by-stride in logits_buf) Five-Whys 1. Why does val_loss=18.55 > ln(vocab)=17.21 with fresh APR? Because the CUDA forward path produces sub-random logits even though CPU forward on the same weights produces sensible ones. 2. Why does CUDA differ from CPU? Because the bug is in one of: GPU upload, GPU kernels, or eval_batch's cross_entropy buffer handling. CPU path is end-to-end clean. 3. Why didn't existing falsifiers catch this? Per `feedback_test_methodology_can_fake_bugs.md`, the CUDA path was validated by convergence on synthetic data (§44/§45) and from-scratch (§50.4 cascade) — both blind to forward-pass parity vs CPU reference. 4. Why ship the CPU bisect instead of fixing CUDA directly? Because pinpointing the bug at the BACKEND boundary (CPU vs CUDA) is the cheapest narrowing. Without this, the next agent would have to re-derive that the CPU side works. 5. Why does this matter for ship %? With H4 narrowed to CUDA, the next falsifier-discharge cascade (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001) has a clear scope: CPU↔CUDA forward parity test, dump per-layer hidden states, identify divergence point. What this PR ships `falsify_h4_cpu_forward_qwen_logits_sensible` — host-gated test that loads Qwen 0.5B (fresh APR preferred), populates a polymorphic Transformer, forward-passes a single token, and asserts: - logits are finite (no NaN/Inf) - logits std > 0.01 (not constant) - peak-to-mean > 1.5 (not uniform) - argmax in [0, vocab_size) (proper shape) Empirical run: PASSES on RTX 4090 host with fresh APR. Quality gates - cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS - rustfmt --check: clean - cargo clippy -p aprender-train --lib -- -D warnings: clean SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED to the CUDA path. The CPU path is provably correct. Next-cycle bisection has a tight scope (3 sub-hypotheses, all CUDA-specific). - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-004 (task #23) Out-of-scope follow-ups PMAT-CODE-PRETRAIN-CUDA-FORWARD-001: - Author CPU↔CUDA forward parity falsifier on populated Qwen - Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer) - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 10, 2026 07:45

noahgift mentioned this pull request May 10, 2026

test(aprender-train): H4 CPU forward bisect — CUDA path is the residual root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004) #1602

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-train): H4 init-load diagnostic — finds BF16 dtype mislabel root cause #1 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)#1601

feat(aprender-train): H4 init-load diagnostic — finds BF16 dtype mislabel root cause #1 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)#1601
noahgift wants to merge 1 commit intomainfrom
feat/h4-bisect-init-load

noahgift commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 10, 2026

TL;DR

H4 Root Cause #1: BF16 dtype mislabel

H4 Root Cause #2: STILL OPEN

What this PR ships

Five-Whys

Test plan

SHIP-TWO impact

Out-of-scope follow-ups

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant