feat(format): dimension-independent-kernels-v1 + distributed-training-v1 4-gate PARTIAL discharge by noahgift · Pull Request #1399 · paiml/aprender

noahgift · 2026-05-02T09:49:04Z

Summary

Bundles two unrelated 2-gate sister contracts:

dimension-independent-kernels-v1 (FALSIFY-DIM-001..002): output equiv vs specialized, no per-launch recompile
distributed-training-v1 (FALSIFY-DIST-001..002): gradient sync, loss equivalence

20 unit tests including 7-bucket loss-delta sweep on DIST-002.
Algorithm-level coverage advances by 4 gates; runtime ship % unchanged.

Gates bound

Gate ID	Rule
DIM-001	dim-independent vs specialized within `1e-5`
DIM-002	kernel loaded once (`load_count == 1`), launch_count > 0
DIST-001	every rank's params bit-equal to rank 0
DIST-002	distributed loss vs single-worker within `1e-4`

Five Whys

See commit message — captures bit-exact for DIST-001, looser tolerance for DIST-002 vs DIM-001, and fail-on-zero-launches for DIM-002.

Test plan

cargo test -p aprender-core --lib dim_dist — 20 passed
PMAT pre-commit gates green
CI green

🤖 Generated with Claude Code

…-v1 4-gate PARTIAL discharge Bundles two unrelated 2-gate sister contracts: dimension-independent-kernels-v1 (FALSIFY-DIMENSION_INDEPENDENT_KERNELS_V1_001..002): - DIM-001: dim-independent output ≈ specialized output within 1e-5 - DIM-002: kernel binary loaded once, M/K/N passed at launch distributed-training-v1 (FALSIFY-DISTRIBUTED_TRAINING_V1_001..002): - DIST-001: every rank's params bit-equal to rank 0 after sync - DIST-002: distributed loss ≈ single-worker loss within 1e-4 ## Five Whys 1. Why bundle these two contracts? Both peripheral, span the GPU-kernel-parameterization + distributed-training coverage band; one verdict module captures both without provenance pin overhead. 2. Why does this block ship? Coverage % cannot move while these peripheral contracts are unbound at PARTIAL_ALGORITHM_LEVEL. 3. Why bit-exact (`to_bits()`) for DIST-001 (not f32-tolerant)? The contract says "params identical across workers." Distributed training all-reduce is a deterministic operation — any drift between ranks means a sync bug, not float rounding. ULP-strict catches the regression class "all-reduce silently dropped a gradient on one rank." 4. Why looser 1e-4 for DIST-002 vs 1e-5 for DIM-001? DIM-001 compares two pure GEMM outputs (one path, one numeric ordering). DIST-002 compares full training loss across worker counts — different reduction orders, different batch boundaries. The wider tolerance absorbs reduction-order drift while still catching real divergence. 5. Why fail-on-zero-launches for DIM-002? Vacuous Pass when `launch_count == 0` would mask "the kernel was never dispatched at all" — that's a different bug than "kernel was recompiled per launch" but equally a regression in the dispatch path. Fail-on-zero forces the call site to actually exercise the kernel before claiming the no-recompile gate is satisfied. Adds 20 unit tests including a 7-bucket loss-delta sweep on DIST-002. Realistic-healthy walks the canonical 4-rank training state; pre-fix walks 4 simultaneous regressions. No runtime % shift; algorithm-level coverage advances by 4 gates.

noahgift · 2026-05-12T09:20:42Z

Superseded by #1637 (135-PR squash). The commit content is included verbatim in that PR's diff. Closing now to release runner slots; this PR would have auto-closed when #1637 merges.

noahgift enabled auto-merge (squash) May 11, 2026 15:15

noahgift force-pushed the feat/dim-dist-001-004-partial-discharge branch from dedcc35 to abfd9d0 Compare May 11, 2026 15:15

Merge branch 'main' into feat/dim-dist-001-004-partial-discharge

64f0f85

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 09:20
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(format): dimension-independent-kernels-v1 + distributed-training-v1 4-gate PARTIAL discharge#1399

feat(format): dimension-independent-kernels-v1 + distributed-training-v1 4-gate PARTIAL discharge#1399
noahgift wants to merge 2 commits into
mainfrom
feat/dim-dist-001-004-partial-discharge

noahgift commented May 2, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 2, 2026

Summary

Gates bound

Five Whys

Test plan

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant