feat(format): dimension-independent-kernels-v1 + distributed-training-v1 4-gate PARTIAL discharge#1399
Closed
noahgift wants to merge 2 commits into
Closed
feat(format): dimension-independent-kernels-v1 + distributed-training-v1 4-gate PARTIAL discharge#1399noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…-v1 4-gate PARTIAL discharge Bundles two unrelated 2-gate sister contracts: dimension-independent-kernels-v1 (FALSIFY-DIMENSION_INDEPENDENT_KERNELS_V1_001..002): - DIM-001: dim-independent output ≈ specialized output within 1e-5 - DIM-002: kernel binary loaded once, M/K/N passed at launch distributed-training-v1 (FALSIFY-DISTRIBUTED_TRAINING_V1_001..002): - DIST-001: every rank's params bit-equal to rank 0 after sync - DIST-002: distributed loss ≈ single-worker loss within 1e-4 ## Five Whys 1. Why bundle these two contracts? Both peripheral, span the GPU-kernel-parameterization + distributed-training coverage band; one verdict module captures both without provenance pin overhead. 2. Why does this block ship? Coverage % cannot move while these peripheral contracts are unbound at PARTIAL_ALGORITHM_LEVEL. 3. Why bit-exact (`to_bits()`) for DIST-001 (not f32-tolerant)? The contract says "params identical across workers." Distributed training all-reduce is a deterministic operation — any drift between ranks means a sync bug, not float rounding. ULP-strict catches the regression class "all-reduce silently dropped a gradient on one rank." 4. Why looser 1e-4 for DIST-002 vs 1e-5 for DIM-001? DIM-001 compares two pure GEMM outputs (one path, one numeric ordering). DIST-002 compares full training loss across worker counts — different reduction orders, different batch boundaries. The wider tolerance absorbs reduction-order drift while still catching real divergence. 5. Why fail-on-zero-launches for DIM-002? Vacuous Pass when `launch_count == 0` would mask "the kernel was never dispatched at all" — that's a different bug than "kernel was recompiled per launch" but equally a regression in the dispatch path. Fail-on-zero forces the call site to actually exercise the kernel before claiming the no-recompile gate is satisfied. Adds 20 unit tests including a 7-bucket loss-delta sweep on DIST-002. Realistic-healthy walks the canonical 4-rank training state; pre-fix walks 4 simultaneous regressions. No runtime % shift; algorithm-level coverage advances by 4 gates.
dedcc35 to
abfd9d0
Compare
Contributor
Author
auto-merge was automatically disabled
May 12, 2026 09:20
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles two unrelated 2-gate sister contracts:
dimension-independent-kernels-v1(FALSIFY-DIM-001..002): output equiv vs specialized, no per-launch recompiledistributed-training-v1(FALSIFY-DIST-001..002): gradient sync, loss equivalence20 unit tests including 7-bucket loss-delta sweep on DIST-002.
Algorithm-level coverage advances by 4 gates; runtime ship % unchanged.
Gates bound
1e-5load_count == 1), launch_count > 01e-4Five Whys
See commit message — captures bit-exact for DIST-001, looser tolerance for DIST-002 vs DIM-001, and fail-on-zero-launches for DIM-002.
Test plan
cargo test -p aprender-core --lib dim_dist— 20 passed🤖 Generated with Claude Code