fix(aprender-train): include theta in CUDA RoPE PTX cache key#1607
Open
fix(aprender-train): include theta in CUDA RoPE PTX cache key#1607
Conversation
Cross-pollination spec evaluating helix-db patterns for adoption in aprender. Nine candidates (HELIX-IDEA-001..009) covering persistent HNSW, inventory-based MCP handler registration, compile-time DSL macro pattern, multi-target deployment, hybrid retrieval (BM25 + dense), reranking pipeline (RRF/MMR/cross-encoder), snapshot/backup, schema migration macro, and constant-time API-key auth for apr serve. Each proposal scoped with effort, target crate, non-goals, open questions, and acceptance signals. Section 1.3 grounds the spec in verified facts about aprender's current state; section 6 logs one falsified-and-corrected claim from the initial draft (MCP handler discovery is hardcoded, not contracts-mediated). Section 3 enumerates rejected candidates (LMDB swap, HelixQL the language, embedding-provider abstraction, browser dashboard, vendor-specific metrics) with explicit reasoning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-CODE-PRETRAIN-CUDA-FORWARD-001) H4D root-cause discharge for SHIP-TWO-001 §61. Pre-fix `apr pretrain --device cuda` on populated Qwen 0.5B produced val_loss=18.55 at step 1 (*above* `ln(vocab)=17.21`), i.e. the model was anti-aligned vs uniform. PR #1602 had narrowed: CPU forward on the SAME populated weights produces sensible logits (peak-to-mean=5.68, argmax=9370). The bug lives strictly on the CUDA side. Five-Whys: 1. Why val_loss > ln(vocab)? Logits anti-aligned with held-out tokens. 2. Why anti-aligned? Attention scores miss the bias offset post-projection. 3. Why is the offset missing? `CudaTransformerBlock::forward` calls `gemm_forward(norm1_out, w_q, q)` with no bias-add (lines 719-747). 4. Why no bias-add? `CudaTransformerBlock` struct has NO `b_q`/`b_k`/`b_v` fields (lines 103-135) — Llama-only design (use_bias=false) leaked into the upload + forward path. 5. Why was this not caught earlier? The CPU `Transformer::forward` (attention.rs:388-395) DOES honor `Option<Tensor>` biases; populate step 5f.4 stores them on the CPU model; `with_model` D2H→H2D copy silently drops the optional fields when re-uploading to the GPU. Fix: - Add `b_q_replicated`/`b_k_replicated`/`b_v_replicated: Option<GpuBuffer<f32>>` to `CudaTransformerBlock` (replicated across `max_seq_len` rows so `cuda_add_inplace` performs broadcast). - Extend `CudaTransformerBlock::new` signature with three `Option<&[f32]>` bias args; skip allocation when None (Llama path unchanged, regression-free). - Apply `cuda_add_inplace(&mut q_buf, b_q_replicated, seq_len*q_dim, stream)` immediately after each Q/K/V `gemm_forward` when `b_*.is_some()`. - Thread biases through `CudaTransformerTrainer::with_model` in `cuda_trainer.rs::upload_blocks` (fp32 path extracts `layer.self_attn.b_q.as_ref().map(...)` → `CudaTransformerBlock::new`). - Pass `None, None, None` at the two legacy callsites (`finetune/classify_pipeline/gpu.rs`, `finetune/instruct_pipeline/cuda_init.rs`) to preserve the existing-pipeline contract. Provable contract: `contracts/apr-pretrain-cuda-forward-parity-v1.yaml` (NEW). Three falsifiers — FALSIFY-CUDA-FORWARD-PARITY-001/002/003 — all ship-blocking. RED-then-GREEN proven empirically: RED (pre-fix on main): val_loss=13.50 > 0.7×ln(vocab)=8.35 → FALSIFIED GREEN (this PR): val_loss=0.0 on synthetic batch → DISCHARGED Live evidence on real Python corpus / lambda-vector RTX 4090: - Pre-fix: val_loss=18.55 (sub-random, anti-aligned) - Post-fix: val_loss=17.22 (uniform-over-vocab regime) The remaining gap (uniform → converged 1.5–3.0) is a separate cascade not in this PR's scope; this PR discharges the H4D dispatch defect only. Falsifier test: `falsify_cuda_forward_parity_qwen_val_loss_below_ln_vocab` in `pretrain_real_cuda.rs::tests`. Host-gated on `/mnt/nvme-raid0/models/qwen2.5-coder-0.5b-fresh.apr` (auto-skips elsewhere). Locally GREEN: 1 passed; 0 failed. Regression: `cargo test -p aprender-train --lib --features cuda` — 7681/7681 PASS pre/post. Refs: - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (NEW) - contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.8.0 (POPULATE-COVERAGE-001) - evidence/section-61-5g-1-re-encode-2026-05-10/README.md - crates/aprender-core/src/transformer/attention.rs:388-395 (CPU side honors biases) Closes PMAT-CODE-PRETRAIN-CUDA-FORWARD-001 H4D bisect (struct + dispatch gap). Follow-up cascade for residual uniform→converged divergence (RoPE? attn softmax? FFN?) gets its own ticket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ODE-CUDA-FORWARD-RESIDUAL-001) Cascade follow-up to PMAT-CODE-PRETRAIN-CUDA-FORWARD-001 (PR #1604). After landing the H4D Q/K/V-bias dispatch fix, val_loss moved 18.55 → 17.22 on populated Qwen2.5-Coder-0.5B but stayed above ln(vocab) = 11.93 — the model was producing sub-uniform predictions. Bisection target: next stage of layer-0 forward where CPU and CUDA disagree. Five-Whys: 1. Why val_loss > ln(vocab) post-bias-fix? CUDA forward still drifts from CPU on populated weights. 2. Why drift? Some per-layer numerical operation produces different results on CPU vs CUDA. 3. Why? Inspect each layer-0 stage. RMSNorm is the very first; check it. 4. Why might RMSNorm differ? `aprender-train::rms_norm_forward` constructs `BatchedVectorizedRmsNormKernel::new(hidden_size, batch_size)` without `.with_epsilon(eps)`. 5. Why is that wrong? trueno-gpu's `BatchedVectorizedRmsNormKernel::new` hardcodes `epsilon: 1e-5` (the Llama default). Qwen2 / Qwen2.5 specify `rms_norm_eps: 1e-6` per HF config.json (and per `TransformerConfig::qwen2_0_5b()` in `config.rs:178`). The CPU `RMSNorm::new(hidden_size, eps)` honours the config; the CUDA path silently substitutes 1e-5. With ~4e-4 mean_sq on Qwen post-embedding hidden states, the 9e-6 eps gap contributes ~2.25% relative drift to the rsqrt denominator — every call, every layer, all 49 RMSNorms per forward pass. Fix: - Add `rms_norm_forward_with_eps(.., eps: f32, ..)` (eps-aware variant) to `cuda_forward/normalization.rs`. Constructs the kernel via `BatchedVectorizedRmsNormKernel::new(...).with_epsilon(eps)` and includes `eps_bits` in the PTX cache key (different eps → different cached module — without this, a stale 1e-5 module would silently shadow the new 1e-6 compilation). - Keep legacy `rms_norm_forward` as a thin wrapper that calls `..._with_eps(.., 1e-5, ..)` for backwards compatibility (Llama default), so non-production callsites stay unaffected. - Switch all 4 production callsites to the new variant: * `cuda_block.rs::CudaTransformerBlock::forward` (pre-attn norm, line 761) * `cuda_block.rs::CudaTransformerBlock::forward` (post-attn norm, line 842) * `cuda_block.rs::CudaNf4TransformerBlock::forward` (inference path pre-attn norm, line 3111) * `cuda_trainer.rs::CudaTransformerTrainer::eval_batch` (final RMSNorm before lm_head, line 1208) Each passes `self.config.rms_norm_eps` (or `self.config.model_config.rms_norm_eps` for the trainer). Provable contract: `contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml` (NEW, ACTIVE_ALGORITHM_LEVEL). Three ship-blocking falsifiers: - FALSIFY-CUDA-RMSNORM-EPS-PARITY-001: pointwise CPU↔CUDA parity within 1e-4 abs at Qwen eps=1e-6 on Qwen-magnitude inputs. - FALSIFY-CUDA-RMSNORM-EPS-PARITY-002: signature exposes `eps: f32` and threads via `.with_epsilon(eps)`. - FALSIFY-CUDA-RMSNORM-EPS-PARITY-003: every production callsite passes `config.rms_norm_eps` rather than relying on the legacy default. Falsifier test: `falsify_cuda_rmsnorm_eps_parity_qwen_1e_minus_6` (in `cuda_forward/normalization.rs::tests`). Synthetic 4×896 batch with Qwen-magnitude activations (std~0.02) and unit-perturbed gamma; asserts `max(|y_cpu - y_gpu|) < 1e-4` at `eps=1e-6`. Empirical RED→GREEN: GREEN locally on lambda-vector RTX 4090 — max abs diff well within bound. Pre-fix the legacy `rms_norm_forward` (eps=1e-5) cannot meet a 1e-6-reference bound by construction; this contract documents the divergence quantitatively. Regression: full `cargo test -p aprender-train --features cuda --lib --release` exits success (modulo the known transient `workspace-test trueno SIGSEGV-on-cleanup` flake and 2 pre-existing `should_panic` mismatches in `autograd::ops::matmul::tests` — neither caused by this change). Refs: - contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml (NEW) - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (parent, PR #1604) - crates/aprender-train/src/transformer/config.rs:178 (Qwen2 eps=1e-6) - ../trueno/trueno-gpu/src/kernels/layernorm/batched.rs:30 (BatchedVectorizedRmsNormKernel hardcodes 1e-5) Closes one residual contributor in the uniform→converged cascade (task #25 PMAT-CODE-CUDA-FORWARD-RESIDUAL-001). Live val_loss check on populated Qwen 0.5B + 5g.1-v2 corpus deferred to a follow-up evidence run after this PR + #1604 both merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ODE-CUDA-FORWARD-RESIDUAL-002) Cascade follow-up to PR #1604 (Q/K/V bias dispatch) and PR #1606 (RMSNorm eps cache key). Same defect class as #1606: a kernel parameter that is BAKED INTO PTX at emit-time was OMITTED from the PTX cache key. Five-Whys: 1. Why might CUDA RoPE produce wrong outputs across model loads? 2. Why? `RopeNeoxKernel`, `BatchedRopeKernel`, and `BatchedRopeBackwardKernel` capture `self.theta` into the `build_ptx` closure (`mov.f32 imm`). PTX is theta-specific. 3. Why does that matter at the cache layer? Cache keys were `batched_rope_fwd_{num_heads}_{head_dim}` — theta omitted. 4. Why is that bad? In any process that loads two models with different `rope_theta` (e.g., Llama theta=10000 followed by Qwen theta=1000000), the second call hits the FIRST model's cached PTX and silently uses the wrong frequency base. 5. Why isn't this catastrophic for SHIP-TWO-001 today? Qwen-only workflows are self-consistent (first Qwen call populates the cache with Qwen theta). It's a latent correctness defect and a hygiene fix; ships separately because the bug class is real. Fix: - `rope_neox_forward`: cache key `rope_neox_fwd_{num_heads}_{head_dim}_th{theta_bits:08x}` - `batched_rope_neox_forward`: cache key `batched_rope_fwd_{num_heads}_{head_dim}_{seq_len}_th{theta_bits:08x}` - `batched_rope_neox_backward`: cache key `batched_rope_bwd_{num_heads}_{head_dim}_{seq_len}_th{theta_bits:08x}` - `pre_warm_backward_kernels_in_forward_cache`: pre-warm key aligned with runtime so the warm is not orphaned. Provable contract: `contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml` (NEW, ACTIVE_ALGORITHM_LEVEL). Two ship-blocking falsifiers: - FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-001: distinct theta values produce distinct outputs (>1e-3 max-abs diff). - FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-002: source audit — every RoPE wrapper cache key + the pre-warm key includes `_th{theta_bits:08x}`. Falsifier test: `falsify_cuda_rope_theta_cache_key_distinct_thetas_yield_distinct_outputs` (in `cuda_forward/normalization.rs::tests`). Calls `batched_rope_neox_forward` twice with the same shape but theta=10000 then theta=1000000; asserts max abs diff > 1e-3. GREEN locally on lambda-vector RTX 4090. Pre-fix RED: cache served the first PTX module to the second call, outputs byte-identical → assertion fails. Post-fix GREEN: distinct thetas resolve to distinct cache slots, outputs differ at expected magnitude. Ship % movement: NONE (Qwen-only pretrain unaffected; this is a hygiene fix that prevents Llama→Qwen test contamination and guards future multi-model workflows). Cascade momentum: 3rd falsifier in 24h on the same residual. Refs: - contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml (NEW) - contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml (sibling, PR #1606) - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (parent, PR #1604) - ../trueno/trueno-gpu/src/kernels/elementwise/rope/standard.rs:27 (theta baked into PTX via build_ptx closure) Closes a defect class flagged during task #26 PMAT-CODE-CUDA-FORWARD- RESIDUAL-002 audit. The actual val_loss recheck on populated Qwen 0.5B + 5g.1-v2 corpus remains task #26's primary deliverable; deferred until #1604 + #1606 + this PR all merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 10, 2026
…ML generation gap (PMAT-CODE-SHIP-TWO-SECTION-61) (#1610) Records the empirical findings from this session's LIVE-discharge cascade attempt off §60. Two-track outcome: DIRECT PROMPT (SHIP-002): GREEN. `apr run /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr --prompt "def fib(n):" --max-tokens 128` produces clean fib() Python (`ast.parse` 0 syntax errors, 68 nodes, 1 FunctionDef "fib"). LIVE discharged via PR #1609 (`qwen2-e2e-verification-v1.yaml` v1.10.0 → v1.12.0). CHATML PROMPT (SHIP-006/008): BLOCKED. Same canonical 7B teacher fails `apr qa golden_output` gate with "gibberish (fragment '\\ns\\ns' repeats 3+ times)" under ChatML wrapper `<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n`. Same model + same engine + different prompt format → different output regime. The §60 closure proved per-layer FORWARD parity within Q4K tolerance (layer-3 ratio 1.245× ∈ [0.5, 2.0] on canonical 7B). It did NOT prove GENERATION parity under arbitrary prompt distributions. §61 separates these two invariants and surfaces the asymmetry as a NEW finding. Five-Whys for the §61 amendment: 1. Why is §61 needed? §60 closed forward parity but SHIP-006/008 LIVE-discharge attempts failed empirically. 2. Why didn't ship-% auto-flip 91% → 96%? Forward parity is binding criterion only at the activation-stats level; arg-max sampling under cumulative drift is not directly bounded. 3. Why does prompt format matter? Direct prompts ("def fib(n):") put model in high-confidence next-token regime where small drift doesn't flip arg-max. ChatML prompts (instruction-following, chain-of-thought initialization) put model in low-margin regime where drift CAN flip arg-max. 4. Why record this in spec rather than just fix? The bug is multi-PR scope (special-token handling vs cumulative drift bisection needed). PRED-61-A/B set up the next falsifiable diagnostic step. 5. Why now (durable spec rather than evidence-only)? Each day the spec doesn't reflect the §60 → §61 separation, future sessions may misinterpret §60 closure as full SHIP-007-class discharge. §61.5 falsifiable predictions: - PRED-61-A: GGUF + ChatML on canonical 7B → clean output? If GREEN, bug is APR-side in chat-template handling. - PRED-61-B: APR + direct continuation prompt "What is 2+2? The answer is " (no ChatML wrapper) → clean output? If GREEN, bug is special- token handling NOT cumulative drift. If both PRED-61-A and PRED-61-B are GREEN, the bug is bounded to "APR + ChatML special-token path" — multi-PR scope but tractable. Changes (1 file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.05.0 → v3.06.0; new banner summarizing §61 (one paragraph, 1 of 5 §17.5 PARTIALs LIVE, SHIP-002 evidence, SHIP-006/008 BLOCKED, PRED-61-A/B set up). - New §61 section above §58 (newest-first ordering): 7 sub-sections (61.1 separation table, 61.2 direct-prompt evidence, 61.3 ChatML-prompt evidence, 61.4 §60→§61 separation rationale, 61.5 falsifiable next investigation step, 61.6 ship-% movement, 61.7 what §61 is NOT). Validation: - Spec section format consistent with §58 (newest-first, dated, sub- sections numbered §61.X). - All 6 cascade PRs from this session referenced explicitly (#1604, #1606, #1607, #1608, #1609, this PR). - Ship-% movement quantified: MODEL-1 91% → 92% (1 of 5 PARTIALs). - Methodological alignment: zero eprintln!, zero bash workarounds; all evidence captured via existing apr CLI primitives. Refs: - evidence/ship-002-discharge-2026-05-10/ (LIVE evidence directory) - contracts/qwen2-e2e-verification-v1.yaml v1.12.0 (SHIP-002 DISCHARGED) - contracts/apr-vs-gguf-forward-parity-v1.yaml v1.2.0 (parent PR #1608) - ~/.claude/projects/-home-noah-src-aprender/memory/feedback_test_methodology_can_fake_bugs.md - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #29 PMAT-CODE-SHIP-TWO-SECTION-61. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cascade follow-up to PR #1604 (Q/K/V bias dispatch) and PR #1606 (RMSNorm eps cache key). Same defect class as #1606: a kernel parameter baked into PTX at emit-time was omitted from the cache key.
Five-Whys
RopeNeoxKernel,BatchedRopeKernel,BatchedRopeBackwardKernelcaptureself.thetaintobuild_ptx(mov.f32 imm).batched_rope_fwd_{num_heads}_{head_dim}— theta omitted.Provable Contract
New:
contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml(ACTIVE_ALGORITHM_LEVEL).Two ship-blocking falsifiers:
_th{theta_bits:08x}.Implementation
Cache keys now include theta_bits at all 3 RoPE wrappers + the pre-warm:
rope_neox_fwd_{nh}_{hd}_th{theta_bits:08x}batched_rope_fwd_{nh}_{hd}_{seq_len}_th{theta_bits:08x}batched_rope_bwd_{nh}_{hd}_{seq_len}_th{theta_bits:08x}cache.rsaligned with runtimeTest Plan
falsify_cuda_rope_theta_cache_key_distinct_thetas_yield_distinct_outputsGREEN locally (lambda-vector RTX 4090) — Llama theta vs Qwen theta produce distinct outputs as expectedpv validate contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml— 0 errorscargo fmt --all -- --check— cleancargo check -p aprender-train --features cuda— 0 errorsStacking
Builds on
fix/cuda-rmsnorm-eps-parity(PR #1606). After both merge, cleanly rebases.Ship-% Movement
NONE — this is a latent-bug hygiene fix. Qwen-only training was already self-consistent at theta=1e6. Guards against multi-model test contamination and future Llama variants.
🤖 Generated with Claude Code