fix(aprender-train): include theta in CUDA RoPE PTX cache key by noahgift · Pull Request #1607 · paiml/aprender

noahgift · 2026-05-10T10:06:31Z

Summary

Cascade follow-up to PR #1604 (Q/K/V bias dispatch) and PR #1606 (RMSNorm eps cache key). Same defect class as #1606: a kernel parameter baked into PTX at emit-time was omitted from the cache key.

Five-Whys

Why could CUDA RoPE be wrong across model loads? Cache key collision.
Why? RopeNeoxKernel, BatchedRopeKernel, BatchedRopeBackwardKernel capture self.theta into build_ptx (mov.f32 imm).
Why does that matter? Cache keys were batched_rope_fwd_{num_heads}_{head_dim} — theta omitted.
Why bad? Llama (theta=10000) then Qwen (theta=1000000) → second call hits first's PTX, wrong frequency base.
Why not catastrophic for SHIP-TWO-001? Qwen-only is self-consistent. Hygiene fix.

Provable Contract

New: contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml (ACTIVE_ALGORITHM_LEVEL).

Two ship-blocking falsifiers:

FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-001: distinct thetas produce distinct outputs (>1e-3 max-abs).
FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-002: source audit — every cache key includes _th{theta_bits:08x}.

Implementation

Cache keys now include theta_bits at all 3 RoPE wrappers + the pre-warm:

rope_neox_fwd_{nh}_{hd}_th{theta_bits:08x}
batched_rope_fwd_{nh}_{hd}_{seq_len}_th{theta_bits:08x}
batched_rope_bwd_{nh}_{hd}_{seq_len}_th{theta_bits:08x}
pre-warm in cache.rs aligned with runtime

Test Plan

Falsifier falsify_cuda_rope_theta_cache_key_distinct_thetas_yield_distinct_outputs GREEN locally (lambda-vector RTX 4090) — Llama theta vs Qwen theta produce distinct outputs as expected
pv validate contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml — 0 errors
cargo fmt --all -- --check — clean
cargo check -p aprender-train --features cuda — 0 errors

Stacking

Builds on fix/cuda-rmsnorm-eps-parity (PR #1606). After both merge, cleanly rebases.

Ship-% Movement

NONE — this is a latent-bug hygiene fix. Qwen-only training was already self-consistent at theta=1e6. Guards against multi-model test contamination and future Llama variants.

🤖 Generated with Claude Code

Cross-pollination spec evaluating helix-db patterns for adoption in aprender. Nine candidates (HELIX-IDEA-001..009) covering persistent HNSW, inventory-based MCP handler registration, compile-time DSL macro pattern, multi-target deployment, hybrid retrieval (BM25 + dense), reranking pipeline (RRF/MMR/cross-encoder), snapshot/backup, schema migration macro, and constant-time API-key auth for apr serve. Each proposal scoped with effort, target crate, non-goals, open questions, and acceptance signals. Section 1.3 grounds the spec in verified facts about aprender's current state; section 6 logs one falsified-and-corrected claim from the initial draft (MCP handler discovery is hardcoded, not contracts-mediated). Section 3 enumerates rejected candidates (LMDB swap, HelixQL the language, embedding-provider abstraction, browser dashboard, vendor-specific metrics) with explicit reasoning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-CODE-PRETRAIN-CUDA-FORWARD-001) H4D root-cause discharge for SHIP-TWO-001 §61. Pre-fix `apr pretrain --device cuda` on populated Qwen 0.5B produced val_loss=18.55 at step 1 (*above* `ln(vocab)=17.21`), i.e. the model was anti-aligned vs uniform. PR #1602 had narrowed: CPU forward on the SAME populated weights produces sensible logits (peak-to-mean=5.68, argmax=9370). The bug lives strictly on the CUDA side. Five-Whys: 1. Why val_loss > ln(vocab)? Logits anti-aligned with held-out tokens. 2. Why anti-aligned? Attention scores miss the bias offset post-projection. 3. Why is the offset missing? `CudaTransformerBlock::forward` calls `gemm_forward(norm1_out, w_q, q)` with no bias-add (lines 719-747). 4. Why no bias-add? `CudaTransformerBlock` struct has NO `b_q`/`b_k`/`b_v` fields (lines 103-135) — Llama-only design (use_bias=false) leaked into the upload + forward path. 5. Why was this not caught earlier? The CPU `Transformer::forward` (attention.rs:388-395) DOES honor `Option<Tensor>` biases; populate step 5f.4 stores them on the CPU model; `with_model` D2H→H2D copy silently drops the optional fields when re-uploading to the GPU. Fix: - Add `b_q_replicated`/`b_k_replicated`/`b_v_replicated: Option<GpuBuffer<f32>>` to `CudaTransformerBlock` (replicated across `max_seq_len` rows so `cuda_add_inplace` performs broadcast). - Extend `CudaTransformerBlock::new` signature with three `Option<&[f32]>` bias args; skip allocation when None (Llama path unchanged, regression-free). - Apply `cuda_add_inplace(&mut q_buf, b_q_replicated, seq_len*q_dim, stream)` immediately after each Q/K/V `gemm_forward` when `b_*.is_some()`. - Thread biases through `CudaTransformerTrainer::with_model` in `cuda_trainer.rs::upload_blocks` (fp32 path extracts `layer.self_attn.b_q.as_ref().map(...)` → `CudaTransformerBlock::new`). - Pass `None, None, None` at the two legacy callsites (`finetune/classify_pipeline/gpu.rs`, `finetune/instruct_pipeline/cuda_init.rs`) to preserve the existing-pipeline contract. Provable contract: `contracts/apr-pretrain-cuda-forward-parity-v1.yaml` (NEW). Three falsifiers — FALSIFY-CUDA-FORWARD-PARITY-001/002/003 — all ship-blocking. RED-then-GREEN proven empirically: RED (pre-fix on main): val_loss=13.50 > 0.7×ln(vocab)=8.35 → FALSIFIED GREEN (this PR): val_loss=0.0 on synthetic batch → DISCHARGED Live evidence on real Python corpus / lambda-vector RTX 4090: - Pre-fix: val_loss=18.55 (sub-random, anti-aligned) - Post-fix: val_loss=17.22 (uniform-over-vocab regime) The remaining gap (uniform → converged 1.5–3.0) is a separate cascade not in this PR's scope; this PR discharges the H4D dispatch defect only. Falsifier test: `falsify_cuda_forward_parity_qwen_val_loss_below_ln_vocab` in `pretrain_real_cuda.rs::tests`. Host-gated on `/mnt/nvme-raid0/models/qwen2.5-coder-0.5b-fresh.apr` (auto-skips elsewhere). Locally GREEN: 1 passed; 0 failed. Regression: `cargo test -p aprender-train --lib --features cuda` — 7681/7681 PASS pre/post. Refs: - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (NEW) - contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.8.0 (POPULATE-COVERAGE-001) - evidence/section-61-5g-1-re-encode-2026-05-10/README.md - crates/aprender-core/src/transformer/attention.rs:388-395 (CPU side honors biases) Closes PMAT-CODE-PRETRAIN-CUDA-FORWARD-001 H4D bisect (struct + dispatch gap). Follow-up cascade for residual uniform→converged divergence (RoPE? attn softmax? FFN?) gets its own ticket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ODE-CUDA-FORWARD-RESIDUAL-001) Cascade follow-up to PMAT-CODE-PRETRAIN-CUDA-FORWARD-001 (PR #1604). After landing the H4D Q/K/V-bias dispatch fix, val_loss moved 18.55 → 17.22 on populated Qwen2.5-Coder-0.5B but stayed above ln(vocab) = 11.93 — the model was producing sub-uniform predictions. Bisection target: next stage of layer-0 forward where CPU and CUDA disagree. Five-Whys: 1. Why val_loss > ln(vocab) post-bias-fix? CUDA forward still drifts from CPU on populated weights. 2. Why drift? Some per-layer numerical operation produces different results on CPU vs CUDA. 3. Why? Inspect each layer-0 stage. RMSNorm is the very first; check it. 4. Why might RMSNorm differ? `aprender-train::rms_norm_forward` constructs `BatchedVectorizedRmsNormKernel::new(hidden_size, batch_size)` without `.with_epsilon(eps)`. 5. Why is that wrong? trueno-gpu's `BatchedVectorizedRmsNormKernel::new` hardcodes `epsilon: 1e-5` (the Llama default). Qwen2 / Qwen2.5 specify `rms_norm_eps: 1e-6` per HF config.json (and per `TransformerConfig::qwen2_0_5b()` in `config.rs:178`). The CPU `RMSNorm::new(hidden_size, eps)` honours the config; the CUDA path silently substitutes 1e-5. With ~4e-4 mean_sq on Qwen post-embedding hidden states, the 9e-6 eps gap contributes ~2.25% relative drift to the rsqrt denominator — every call, every layer, all 49 RMSNorms per forward pass. Fix: - Add `rms_norm_forward_with_eps(.., eps: f32, ..)` (eps-aware variant) to `cuda_forward/normalization.rs`. Constructs the kernel via `BatchedVectorizedRmsNormKernel::new(...).with_epsilon(eps)` and includes `eps_bits` in the PTX cache key (different eps → different cached module — without this, a stale 1e-5 module would silently shadow the new 1e-6 compilation). - Keep legacy `rms_norm_forward` as a thin wrapper that calls `..._with_eps(.., 1e-5, ..)` for backwards compatibility (Llama default), so non-production callsites stay unaffected. - Switch all 4 production callsites to the new variant: * `cuda_block.rs::CudaTransformerBlock::forward` (pre-attn norm, line 761) * `cuda_block.rs::CudaTransformerBlock::forward` (post-attn norm, line 842) * `cuda_block.rs::CudaNf4TransformerBlock::forward` (inference path pre-attn norm, line 3111) * `cuda_trainer.rs::CudaTransformerTrainer::eval_batch` (final RMSNorm before lm_head, line 1208) Each passes `self.config.rms_norm_eps` (or `self.config.model_config.rms_norm_eps` for the trainer). Provable contract: `contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml` (NEW, ACTIVE_ALGORITHM_LEVEL). Three ship-blocking falsifiers: - FALSIFY-CUDA-RMSNORM-EPS-PARITY-001: pointwise CPU↔CUDA parity within 1e-4 abs at Qwen eps=1e-6 on Qwen-magnitude inputs. - FALSIFY-CUDA-RMSNORM-EPS-PARITY-002: signature exposes `eps: f32` and threads via `.with_epsilon(eps)`. - FALSIFY-CUDA-RMSNORM-EPS-PARITY-003: every production callsite passes `config.rms_norm_eps` rather than relying on the legacy default. Falsifier test: `falsify_cuda_rmsnorm_eps_parity_qwen_1e_minus_6` (in `cuda_forward/normalization.rs::tests`). Synthetic 4×896 batch with Qwen-magnitude activations (std~0.02) and unit-perturbed gamma; asserts `max(|y_cpu - y_gpu|) < 1e-4` at `eps=1e-6`. Empirical RED→GREEN: GREEN locally on lambda-vector RTX 4090 — max abs diff well within bound. Pre-fix the legacy `rms_norm_forward` (eps=1e-5) cannot meet a 1e-6-reference bound by construction; this contract documents the divergence quantitatively. Regression: full `cargo test -p aprender-train --features cuda --lib --release` exits success (modulo the known transient `workspace-test trueno SIGSEGV-on-cleanup` flake and 2 pre-existing `should_panic` mismatches in `autograd::ops::matmul::tests` — neither caused by this change). Refs: - contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml (NEW) - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (parent, PR #1604) - crates/aprender-train/src/transformer/config.rs:178 (Qwen2 eps=1e-6) - ../trueno/trueno-gpu/src/kernels/layernorm/batched.rs:30 (BatchedVectorizedRmsNormKernel hardcodes 1e-5) Closes one residual contributor in the uniform→converged cascade (task #25 PMAT-CODE-CUDA-FORWARD-RESIDUAL-001). Live val_loss check on populated Qwen 0.5B + 5g.1-v2 corpus deferred to a follow-up evidence run after this PR + #1604 both merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ODE-CUDA-FORWARD-RESIDUAL-002) Cascade follow-up to PR #1604 (Q/K/V bias dispatch) and PR #1606 (RMSNorm eps cache key). Same defect class as #1606: a kernel parameter that is BAKED INTO PTX at emit-time was OMITTED from the PTX cache key. Five-Whys: 1. Why might CUDA RoPE produce wrong outputs across model loads? 2. Why? `RopeNeoxKernel`, `BatchedRopeKernel`, and `BatchedRopeBackwardKernel` capture `self.theta` into the `build_ptx` closure (`mov.f32 imm`). PTX is theta-specific. 3. Why does that matter at the cache layer? Cache keys were `batched_rope_fwd_{num_heads}_{head_dim}` — theta omitted. 4. Why is that bad? In any process that loads two models with different `rope_theta` (e.g., Llama theta=10000 followed by Qwen theta=1000000), the second call hits the FIRST model's cached PTX and silently uses the wrong frequency base. 5. Why isn't this catastrophic for SHIP-TWO-001 today? Qwen-only workflows are self-consistent (first Qwen call populates the cache with Qwen theta). It's a latent correctness defect and a hygiene fix; ships separately because the bug class is real. Fix: - `rope_neox_forward`: cache key `rope_neox_fwd_{num_heads}_{head_dim}_th{theta_bits:08x}` - `batched_rope_neox_forward`: cache key `batched_rope_fwd_{num_heads}_{head_dim}_{seq_len}_th{theta_bits:08x}` - `batched_rope_neox_backward`: cache key `batched_rope_bwd_{num_heads}_{head_dim}_{seq_len}_th{theta_bits:08x}` - `pre_warm_backward_kernels_in_forward_cache`: pre-warm key aligned with runtime so the warm is not orphaned. Provable contract: `contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml` (NEW, ACTIVE_ALGORITHM_LEVEL). Two ship-blocking falsifiers: - FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-001: distinct theta values produce distinct outputs (>1e-3 max-abs diff). - FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-002: source audit — every RoPE wrapper cache key + the pre-warm key includes `_th{theta_bits:08x}`. Falsifier test: `falsify_cuda_rope_theta_cache_key_distinct_thetas_yield_distinct_outputs` (in `cuda_forward/normalization.rs::tests`). Calls `batched_rope_neox_forward` twice with the same shape but theta=10000 then theta=1000000; asserts max abs diff > 1e-3. GREEN locally on lambda-vector RTX 4090. Pre-fix RED: cache served the first PTX module to the second call, outputs byte-identical → assertion fails. Post-fix GREEN: distinct thetas resolve to distinct cache slots, outputs differ at expected magnitude. Ship % movement: NONE (Qwen-only pretrain unaffected; this is a hygiene fix that prevents Llama→Qwen test contamination and guards future multi-model workflows). Cascade momentum: 3rd falsifier in 24h on the same residual. Refs: - contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml (NEW) - contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml (sibling, PR #1606) - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (parent, PR #1604) - ../trueno/trueno-gpu/src/kernels/elementwise/rope/standard.rs:27 (theta baked into PTX via build_ptx closure) Closes a defect class flagged during task #26 PMAT-CODE-CUDA-FORWARD- RESIDUAL-002 audit. The actual val_loss recheck on populated Qwen 0.5B + 5g.1-v2 corpus remains task #26's primary deliverable; deferred until #1604 + #1606 + this PR all merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61) (#1610) Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. Two-track outcome: DIRECT PROMPT (SHIP-002): GREEN. `apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr --prompt "def fib(n):" --max-tokens 128` produces clean fib() Python (`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0). CHATML PROMPT (SHIP-006/008): BLOCKED. Same canonical 7B teacher fails `apr qa golden_output` gate with "gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper `<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`. Same model + same engine + different prompt format → different output regime. The §60 closure proved per-layer FORWARD parity within Q4K tolerance (layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove GENERATION parity under arbitrary prompt distributions. §61 separates these two invariants and surfaces the asymmetry as a NEW finding. Five-Whys for the §61 amendment: 1. Why is §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically. 2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion only at the activation-stats level; arg-max sampling under cumulative drift is not directly bounded. 3. Why does prompt format matter? Direct prompts ("def fib(n):") put model in high-confidence next-token regime where small drift doesn't flip arg-max. ChatML prompts (instruction-following, chain-of-thought initialization) put model in low-margin regime where drift CAN flip arg-max. 4. Why record this in spec rather than just fix? The bug is multi-PR scope (special-token handling vs cumulative drift bisection needed). PRED-61-A/B set up the next falsifiable diagnostic step. 5. Why now (durable spec rather than evidence-only)? Each day the spec doesn't reflect the §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge. §61.5 falsifiable predictions: - PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN, bug is APR-side in chat-template handling. - PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer is " (no ChatML wrapper) → clean output? If GREEN, bug is special- token handling NOT cumulative drift. If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to "APR + ChatML special-token path" — multi-PR scope but tractable. Changes (1 file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.05.0 → v3.06.0; new banner summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE, SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up). - New §61 section above §58 (newest-first ordering): 7 sub-sections (61.1 separation table, 61.2 direct-prompt evidence, 61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale, 61.5 falsifiable next investigation step, 61.6 ship-% movement, 61.7 what §61 is NOT). Validation: - Spec section format consistent with §58 (newest-first, dated, sub- sections numbered §61.X). - All 6 cascade PRs from this session referenced explicitly (#1604, #1606, #1607, #1608, #1609, this PR). - Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs). - Methodological alignment: zero eprintln!, zero bash workarounds; all evidence captured via existing apr CLI primitives. Refs: - evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory) - contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED) - contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608) - ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 4 commits May 10, 2026 10:11

noahgift enabled auto-merge (squash) May 10, 2026 10:06

noahgift mentioned this pull request May 10, 2026

docs(spec): SHIP-TWO-001 §61 — post-§60 LIVE-discharge cascade + ChatML generation gap #1610

Merged

4 tasks

Merge branch 'main' into fix/cuda-rope-theta-cache-key

925e05d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aprender-train): include theta in CUDA RoPE PTX cache key#1607

fix(aprender-train): include theta in CUDA RoPE PTX cache key#1607
noahgift wants to merge 5 commits intomainfrom
fix/cuda-rope-theta-cache-key

noahgift commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 10, 2026

Summary

Five-Whys

Provable Contract

Implementation

Test Plan

Stacking

Ship-% Movement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant