Skip to content

feat(aprender-train): H4 init-load diagnostic — finds BF16 dtype mislabel root cause #1 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)#1601

Open
noahgift wants to merge 1 commit intomainfrom
feat/h4-bisect-init-load
Open

feat(aprender-train): H4 init-load diagnostic — finds BF16 dtype mislabel root cause #1 (PMAT-CODE-PRETRAIN-INIT-LOAD-003)#1601
noahgift wants to merge 1 commit intomainfrom
feat/h4-bisect-init-load

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

TL;DR

H4 cascade bisect: root cause #1 FOUND and fixed (BF16 dtype mislabel in old Qwen APR), root cause #2 STILL OPEN (val_loss=18.55 still > ln(vocab)=17.21 with fresh APR).

This PR ships the diagnostic falsifier infrastructure that pinned root cause #1 via element-by-element cross-check against the HF safetensors source.

H4 Root Cause #1: BF16 dtype mislabel

The OLD qwen2.5-coder-0.5b-instruct-fp16.apr (May-4 import) tags tensors as dtype=F16 in the APR v2 header. The SOURCE HF safetensors uses dtype=BF16. The loader sees F16 and dequantizes via f16_to_f32, producing distorted values.

Element-0 cross-check on model.norm.weight (n=896):

Source Values [0..6]
Safetensors (BF16-decoded) 7.5625, 8.0, 7.21875, 7.3125, 7.46875, 7.375
Old APR (loaded as F16) 7.0625, 7.125, 7.0, 7.0625, 6.75, 6.875 (wrong)
Fresh APR (loaded as BF16) 7.5625, 8.0, 7.21875, 7.3125, 7.46875, 7.375 ✓ matches

Fix: re-import via current apr import. The current StreamingWriter::add_raw_f16_tensor correctly preserves BF16 (line 100-104). The old APR was created with a buggy import path that mis-tagged BF16 as F16.

H4 Root Cause #2: STILL OPEN

Even with correct BF16 weights (fresh APR), val_loss at step 1 is 18.55 — still above ln(vocab) = 17.21 (uniform-over-vocab). The dtype fix moved the dial slightly (was 19.80 with old APR) but didn't resolve the sub-random predictions.

Remaining hypotheses for the residual gap:

  • H4B layout: tensor row/col-major mismatch between Qwen export and aprender expectations
  • H4D forward path: cuBLAS / CudaBlock may produce wrong logits despite correct weights
  • Other: tied embedding fall-through (lm_head: None → embed_tokens reuse) may have sign/scale issue

What this PR ships

falsify_h4_init_stats_qwen_embed_norm_sensible — a host-gated diagnostic test that:

  • Loads Qwen 0.5B init APR (prefers fresh, falls back to legacy)
  • Reports tensor stats (mean, std, min, max) for key tensors
  • Dumps element-0 values for cross-comparison with safetensors source
  • Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm ∈ [0.01, 100])

Industrial validation example:

embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128] — HF LLaMA init scale ✓
model.norm.weight:   mean=7.46,    std=0.84,   range[-2.28, 17.38]  — Qwen-typical ✓
q_proj.bias L0:      mean=0.03,    std=7.88,   range[-65.5, 128]    — Qwen-typical ✓

Five-Whys

  1. Why was the OLD Qwen APR tagged F16? Created by a buggy import path that didn't pass is_bf16 flag through to the writer. Fixed in current apr-cli; old artifact preserved on disk.
  2. Why does the fresh APR not fully fix val_loss? The dtype fix makes loaded values match safetensors, but val_loss=18.55 still exceeds ln(vocab)=17.21 — forward path or some other tensor is still producing sub-random predictions.
  3. Why didn't existing falsifiers catch the dtype mislabel? No falsifier asserted "loaded values match safetensors source element-by-element". PMAT-187 NaN/Inf/explosive-mean check passes because BF16-as-F16 distortion produces values that are neither NaN nor unusually large.
  4. Why ship the diagnostic before the full H4 fix? The diagnostic itself proves root cause Feature Request: Decision Tree & Random Forest for Classification Tasks #1 AND provides the bisection foundation for Feature Request: Cross-Validation Utilities #2. Per feedback_falsifier_first_cascade_pattern.md, 1 PR ≈ 1 falsifier discharge.
  5. Why does the operator need to know? They have an old Qwen APR that mis-decodes silently. With this diagnostic they can verify before training; without it, the silent error wastes ~17 hours of GPU time per cycle.

Test plan

  • cargo test -p aprender-train --lib falsify_h4_init_stats: PASS
  • cargo test -p aprender-train --lib: 7585+ tests PASS
  • cargo clippy -p aprender-train --lib -- -D warnings: clean (lib-only; --tests has 4 PRE-EXISTING errors on main, not introduced by this PR)
  • rustfmt --check: clean
  • LIVE diagnostic on RTX 4090: confirmed dtype mismatch + element-0 divergence

SHIP-TWO impact

Out-of-scope follow-ups

PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade):

  • Bisect H4B (layout): forward-pass element-wise compare against HF Qwen2 reference at each layer
  • Bisect H4D (forward path): instrument cuBLAS GEMM outputs against a CPU reference matmul
  • Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Files

  • crates/aprender-train/src/train/pretrain_real.rs (+189/-77, falsify_h4_init_stats_qwen_embed_norm_sensible test)

🤖 Generated with Claude Code

…ETRAIN-INIT-LOAD-003)

Bisects the §61 val_loss > ln(vocab) anomaly. Empirical findings on
lambda-vector RTX 4090:

H4 ROOT CAUSE #1: BF16 dtype mislabel
======================================

The OLD `qwen2.5-coder-0.5b-instruct-fp16.apr` (May-4 import) tags
its tensors with dtype=F16 in the APR v2 header — but the SOURCE
HF safetensors `model.safetensors` uses dtype=BF16. When the loader
sees dtype=F16, it dequantizes via `f16_to_f32`, producing values
that diverge from the BF16-correct decode.

Element-0 cross-check on `model.norm.weight`:
  Safetensors source (BF16-decoded): 7.5625, 8.0, 7.21875, ...
  Old APR (loaded as F16):           7.0625, 7.125, 7.0, ...
  Fresh APR (loaded as BF16):        7.5625, 8.0, 7.21875, ...  ← matches source

Element-0 cross-check on `model.layers.0.self_attn.q_proj.bias`:
  Safetensors (BF16):  0.0674, -0.0859, 0.1104, -0.0605, ...
  Old APR (F16):       (different, distorted)
  Fresh APR (BF16):    0.0674, -0.0859, 0.1104, -0.0605, ...  ← matches source

Fix: re-import Qwen safetensors via current `apr import`. The current
`StreamingWriter::add_raw_f16_tensor` correctly preserves BF16 (line
100-104 of streaming_writer.rs). The old APR was created with a buggy
import path that mis-tagged BF16 as F16.

H4 ROOT CAUSE #2: STILL OPEN
=============================

Even with correct BF16-decoded weights (fresh APR), val_loss at step 1
is **18.55** — still above ln(vocab)=17.21 (uniform-over-vocab
baseline). The dtype fix moved the dial slightly (was 19.80) but did
not resolve the sub-random predictions issue.

Remaining hypotheses for the residual gap:
  H4B (layout): some tensor's row/col-major orientation may differ
       between Qwen export and aprender::Transformer expectations
  H4D (forward path): cuBLAS / CudaBlock forward may produce
       wrong logits despite correct weight values
  Other: tied embedding fall-through path (`lm_head: None` →
       embed_tokens reuse) may have a sign or scale issue

This PR ships the diagnostic infrastructure that PROVED root cause #1
and provides the foundation for bisecting the remaining gap.

What this PR ships
===================

`falsify_h4_init_stats_qwen_embed_norm_sensible` — a host-gated
diagnostic test that:
  - Loads the Qwen 0.5B init APR (prefers fresh, falls back to legacy)
  - Reports tensor stats (mean, std, min, max) for embed_tokens,
    final norm, per-layer norms, q/k/v projections, mlp gates
  - Asserts sensible bounds (embed std ∈ [0.005, 0.5], norm in
    [0.01, 100], etc.)
  - Dumps element-0 values for cross-comparison with safetensors
    source

Industrial validation example output:
  embed_tokens.weight: mean=0.00014, std=0.0152, range[-0.196, 0.128]
                       — sensible HF LLaMA init scale
  model.norm.weight:   mean=7.46, std=0.84, range[-2.28, 17.38]
                       — Qwen-typical (final norm scaled up)
  q_proj.bias L0:      mean=0.03, std=7.88, range[-65.5, 128]
                       — Qwen-typical (large attention biases)

Five-Whys
==========

1. Why was the OLD Qwen APR tagged F16? Created by a buggy import
   path that didn't pass `is_bf16` flag through to the writer.
   Fixed in current apr-cli but the artifact is preserved on disk.
2. Why does the fresh APR not fully fix val_loss? The dtype fix
   makes loaded values match safetensors, but val_loss=18.55 still
   exceeds ln(vocab)=17.21 — meaning forward path or some other
   tensor is still producing sub-random predictions.
3. Why didn't existing falsifiers catch the dtype mislabel? No
   falsifier asserted "loaded values match safetensors source
   element-by-element". The PMAT-187 NaN/Inf/explosive-mean check
   passes because BF16-as-F16 distortion produces values that are
   neither NaN nor unusually large.
4. Why ship the diagnostic before the full H4 fix? The diagnostic
   itself proves H4 root cause #1 and provides the bisection
   foundation for #2. Per `feedback_falsifier_first_cascade_pattern.md`,
   1 PR ≈ 1 falsifier discharge. The dtype-mislabel discharge is
   real progress.
5. Why does the operator need to know? They have an old Qwen APR
   on disk that mis-decodes silently. With this PR's diagnostic
   they can verify before training; without it, the silent error
   wastes ~17 hours of GPU time per cycle (per §60 evidence).

Quality gates (all green)
==========================

- cargo test -p aprender-train --lib falsify_h4_init_stats: PASS
- cargo test -p aprender-train --lib: 7585+ tests PASS
- cargo clippy -p aprender-train --lib -- -D warnings: clean
  (--tests has 4 PRE-EXISTING errors on main; not introduced by this PR)
- rustfmt --check: clean

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — H4 root cause #1 found and fix
  available (use fresh APR), but val_loss still >ln(vocab). The
  next-cycle bisection (H4B or H4D) is now well-targeted.
- §60 H1C cascade: FULLY CLOSED per #1598
- §61 evidence: 5g.1-v2 corpus is 7.42 bits entropy / 0% unk
- This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-003 (task #22)

Out-of-scope follow-ups
========================

PMAT-CODE-PRETRAIN-INIT-LOAD-004 (H4 residual cascade):
  - Bisect H4B (layout): forward-pass element-wise compare against
    HF Qwen2 reference at each layer
  - Bisect H4D (forward path): instrument cuBLAS GEMM outputs
    against a CPU reference matmul
  - Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 10, 2026 07:45
noahgift added a commit that referenced this pull request May 10, 2026
…al root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004) (#1602)

H4 cascade bisection: BUG IS IN CUDA PATH.

EMPIRICAL FINDING

CPU `aprender::Transformer::forward` on a populated Qwen 0.5B
model (fresh APR, BF16-correct dtype) produces SENSIBLE logits:

  populated: 290/290 tensors
  logits: n=151936 nan=0 inf=0
          min=-15.03 max=11.72 mean=-3.33 std=2.65
          peak-to-mean ratio = 5.68
          argmax = 9370 (specific token, not flat)

This means:
  - Populate path: GREEN (all 290 Qwen tensors loaded)
  - CPU forward: GREEN (clean logits, sensible distribution)
  - lm_head tied-embedding fall-through: GREEN (matmul produces
    proper logit distribution despite lm_head=None)

H4 ROOT CAUSE LOCALIZATION (post this PR):

| Component | Pre-this-PR | Post-this-PR |
|-----------|-------------|--------------|
| BF16 dtype tag | OPEN | FIXED #1 (PR #1601) |
| Populate (290/290) | OPEN | FALSIFIED — works ✓ |
| CPU forward | OPEN | FALSIFIED — works ✓ |
| Tied embedding | OPEN | FALSIFIED — works ✓ |
| **CUDA path** | OPEN | **CONFIRMED LIVE BUG** |

Empirical contrast:
  CPU forward: argmax=9370 with confident peak (peak-to-mean=5.68)
  CUDA eval_batch: val_loss > ln(vocab) = sub-random predictions

Same weights, same arch, different backend → CUDA forward path
distorts the result. Three CUDA-side sub-hypotheses for the next
session:
  H4D.1 — `CudaTransformerTrainer::with_model` upload distorts
          weights during H2D transfer
  H4D.2 — `gpu_forward` CUDA kernels (cuBLAS GEMM, RoPE, fused
          attention, RMSNorm) produce wrong outputs despite correct
          inputs
  H4D.3 — `fused_cross_entropy_cuda` reads from a wrong buffer
          location (off-by-stride in logits_buf)

Five-Whys

1. Why does val_loss=18.55 > ln(vocab)=17.21 with fresh APR?
   Because the CUDA forward path produces sub-random logits even
   though CPU forward on the same weights produces sensible ones.
2. Why does CUDA differ from CPU? Because the bug is in one of:
   GPU upload, GPU kernels, or eval_batch's cross_entropy buffer
   handling. CPU path is end-to-end clean.
3. Why didn't existing falsifiers catch this? Per `feedback_test_methodology_can_fake_bugs.md`,
   the CUDA path was validated by convergence on synthetic data
   (§44/§45) and from-scratch (§50.4 cascade) — both blind to
   forward-pass parity vs CPU reference.
4. Why ship the CPU bisect instead of fixing CUDA directly?
   Because pinpointing the bug at the BACKEND boundary (CPU vs
   CUDA) is the cheapest narrowing. Without this, the next agent
   would have to re-derive that the CPU side works.
5. Why does this matter for ship %? With H4 narrowed to CUDA,
   the next falsifier-discharge cascade (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001)
   has a clear scope: CPU↔CUDA forward parity test, dump per-layer
   hidden states, identify divergence point.

What this PR ships

`falsify_h4_cpu_forward_qwen_logits_sensible` — host-gated test
that loads Qwen 0.5B (fresh APR preferred), populates a polymorphic
Transformer, forward-passes a single token, and asserts:
  - logits are finite (no NaN/Inf)
  - logits std > 0.01 (not constant)
  - peak-to-mean > 1.5 (not uniform)
  - argmax in [0, vocab_size) (proper shape)

Empirical run: PASSES on RTX 4090 host with fresh APR.

Quality gates

- cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS
- rustfmt --check: clean
- cargo clippy -p aprender-train --lib -- -D warnings: clean

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED
  to the CUDA path. The CPU path is provably correct. Next-cycle
  bisection has a tight scope (3 sub-hypotheses, all CUDA-specific).
- This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-004 (task #23)

Out-of-scope follow-ups

PMAT-CODE-PRETRAIN-CUDA-FORWARD-001:
  - Author CPU↔CUDA forward parity falsifier on populated Qwen
  - Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer)
  - Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant