Skip to content

test(aprender-train): H4 CPU forward bisect — CUDA path is the residual root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004)#1602

Merged
noahgift merged 1 commit intomainfrom
feat/h4-bisect-cpu-forward-2
May 10, 2026
Merged

test(aprender-train): H4 CPU forward bisect — CUDA path is the residual root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004)#1602
noahgift merged 1 commit intomainfrom
feat/h4-bisect-cpu-forward-2

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

TL;DR

H4 LOCALIZED TO CUDA PATH. CPU aprender::Transformer::forward on populated Qwen 0.5B produces SENSIBLE logits (clean argmax=9370, peak-to-mean=5.68). The bug is in CUDA upload or GPU kernels — not in populate, CPU forward, or tied-embedding fall-through.

Empirical bisection result

CPU forward on populated Qwen 0.5B (fresh APR, BF16-correct):

populated: 290/290 tensors
logits: n=151936 nan=0 inf=0
        min=-15.03 max=11.72 mean=-3.33 std=2.65
        peak-to-mean = 5.68
        argmax = 9370 (specific, not flat)

CUDA eval_batch on same weights: val_loss > ln(vocab) (sub-random).

Same weights, same arch, different backend → CUDA path is the bug.

H4 component status

Component Pre-this-PR Post-this-PR
BF16 dtype tag OPEN FIXED #1 (PR #1601)
Populate (290/290) OPEN FALSIFIED — works ✓
CPU forward OPEN FALSIFIED — works ✓
Tied embedding fall-through OPEN FALSIFIED — works ✓
CUDA path OPEN CONFIRMED LIVE BUG

Three CUDA-side sub-hypotheses (next-cycle work)

  • H4D.1: CudaTransformerTrainer::with_model upload distorts weights during H2D
  • H4D.2: gpu_forward CUDA kernels (cuBLAS GEMM / RoPE / RMSNorm / fused attention) produce wrong outputs
  • H4D.3: fused_cross_entropy_cuda reads wrong buffer location (off-by-stride in logits_buf)

Each is testable via CPU↔CUDA forward parity on populated Qwen.

Five-Whys

  1. Why val_loss=18.55 > ln(vocab)=17.21? CUDA forward produces sub-random logits despite CPU forward producing sensible ones on same weights.
  2. Why CUDA differs from CPU? Bug in upload, kernels, or xent buffer.
  3. Why didn't falsifiers catch this? CUDA path was validated by convergence on synthetic data + from-scratch — both blind to forward-pass parity vs CPU.
  4. Why ship CPU bisect, not CUDA fix? Pinpointing at the backend boundary is cheapest narrowing. Without this, next agent re-derives.
  5. Why does this matter? Next falsifier cascade has a tight scope (3 sub-hypotheses, all CUDA-specific).

Test plan

  • cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS
  • rustfmt --check: clean
  • cargo clippy -p aprender-train --lib -- -D warnings: clean
  • LIVE on RTX 4090 with fresh Qwen APR: peak-to-mean=5.68, argmax=9370

SHIP-TWO impact

  • MODEL-1 ship %: unchanged at 91%
  • MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED to CUDA path. CPU is provably correct. Next-cycle bisection has a tight scope.

Out-of-scope follow-ups

PMAT-CODE-PRETRAIN-CUDA-FORWARD-001:

  • CPU↔CUDA forward parity falsifier on populated Qwen
  • Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer)
  • Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Files

  • crates/aprender-train/src/train/pretrain_real.rs (+110, falsify_h4_cpu_forward_qwen_logits_sensible)

🤖 Generated with Claude Code

…al root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004)

H4 cascade bisection: BUG IS IN CUDA PATH.

EMPIRICAL FINDING

CPU `aprender::Transformer::forward` on a populated Qwen 0.5B
model (fresh APR, BF16-correct dtype) produces SENSIBLE logits:

  populated: 290/290 tensors
  logits: n=151936 nan=0 inf=0
          min=-15.03 max=11.72 mean=-3.33 std=2.65
          peak-to-mean ratio = 5.68
          argmax = 9370 (specific token, not flat)

This means:
  - Populate path: GREEN (all 290 Qwen tensors loaded)
  - CPU forward: GREEN (clean logits, sensible distribution)
  - lm_head tied-embedding fall-through: GREEN (matmul produces
    proper logit distribution despite lm_head=None)

H4 ROOT CAUSE LOCALIZATION (post this PR):

| Component | Pre-this-PR | Post-this-PR |
|-----------|-------------|--------------|
| BF16 dtype tag | OPEN | FIXED #1 (PR #1601) |
| Populate (290/290) | OPEN | FALSIFIED — works ✓ |
| CPU forward | OPEN | FALSIFIED — works ✓ |
| Tied embedding | OPEN | FALSIFIED — works ✓ |
| **CUDA path** | OPEN | **CONFIRMED LIVE BUG** |

Empirical contrast:
  CPU forward: argmax=9370 with confident peak (peak-to-mean=5.68)
  CUDA eval_batch: val_loss > ln(vocab) = sub-random predictions

Same weights, same arch, different backend → CUDA forward path
distorts the result. Three CUDA-side sub-hypotheses for the next
session:
  H4D.1 — `CudaTransformerTrainer::with_model` upload distorts
          weights during H2D transfer
  H4D.2 — `gpu_forward` CUDA kernels (cuBLAS GEMM, RoPE, fused
          attention, RMSNorm) produce wrong outputs despite correct
          inputs
  H4D.3 — `fused_cross_entropy_cuda` reads from a wrong buffer
          location (off-by-stride in logits_buf)

Five-Whys

1. Why does val_loss=18.55 > ln(vocab)=17.21 with fresh APR?
   Because the CUDA forward path produces sub-random logits even
   though CPU forward on the same weights produces sensible ones.
2. Why does CUDA differ from CPU? Because the bug is in one of:
   GPU upload, GPU kernels, or eval_batch's cross_entropy buffer
   handling. CPU path is end-to-end clean.
3. Why didn't existing falsifiers catch this? Per `feedback_test_methodology_can_fake_bugs.md`,
   the CUDA path was validated by convergence on synthetic data
   (§44/§45) and from-scratch (§50.4 cascade) — both blind to
   forward-pass parity vs CPU reference.
4. Why ship the CPU bisect instead of fixing CUDA directly?
   Because pinpointing the bug at the BACKEND boundary (CPU vs
   CUDA) is the cheapest narrowing. Without this, the next agent
   would have to re-derive that the CPU side works.
5. Why does this matter for ship %? With H4 narrowed to CUDA,
   the next falsifier-discharge cascade (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001)
   has a clear scope: CPU↔CUDA forward parity test, dump per-layer
   hidden states, identify divergence point.

What this PR ships

`falsify_h4_cpu_forward_qwen_logits_sensible` — host-gated test
that loads Qwen 0.5B (fresh APR preferred), populates a polymorphic
Transformer, forward-passes a single token, and asserts:
  - logits are finite (no NaN/Inf)
  - logits std > 0.01 (not constant)
  - peak-to-mean > 1.5 (not uniform)
  - argmax in [0, vocab_size) (proper shape)

Empirical run: PASSES on RTX 4090 host with fresh APR.

Quality gates

- cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS
- rustfmt --check: clean
- cargo clippy -p aprender-train --lib -- -D warnings: clean

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED
  to the CUDA path. The CPU path is provably correct. Next-cycle
  bisection has a tight scope (3 sub-hypotheses, all CUDA-specific).
- This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-004 (task #23)

Out-of-scope follow-ups

PMAT-CODE-PRETRAIN-CUDA-FORWARD-001:
  - Author CPU↔CUDA forward parity falsifier on populated Qwen
  - Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer)
  - Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 10, 2026 08:03
@noahgift noahgift merged commit 86ad83b into main May 10, 2026
11 checks passed
@noahgift noahgift deleted the feat/h4-bisect-cpu-forward-2 branch May 10, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Decision Tree & Random Forest for Classification Tasks

1 participant