Skip to content

Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off)#1

Open
supernavyl wants to merge 1 commit intofixes/scan-remediationfrom
stage1/harmonize
Open

Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off)#1
supernavyl wants to merge 1 commit intofixes/scan-remediationfrom
stage1/harmonize

Conversation

@supernavyl
Copy link
Copy Markdown
Owner

Summary

Stage 1 of the harmony/sophistication evolution described by the 2026-04-22 /research brief (EXHAUSTIVE, 71% confidence). Six default-off config flags wire in the highest-leverage Huginn / MoEUT / Parcae primitives without touching existing checkpoint compat. All 59 pre-existing + 23 new tests pass.

Stacks on top of kyegomez#41 (fixes/scan-remediation). Keep in that order; rebase cleanly onto main once kyegomez#41 merges.

What changed

Config knobs (MythosConfig, default-off)

Flag Default Purpose Source
norm_pattern "pre" "sandwich" adds post-sublayer RMSNorms Ding 2021, Huginn
use_qk_norm False RMSNorm on Q and K before matmul Henry 2020, ViT-22B
bptt_truncate_k 0 Truncated BPTT through last-k recurrent loops Huginn (k=8)
recurrent_kv_stride 0 KV-cache sharing across loops at i mod stride Huginn (stride=16)
loop_sample_mean / _sigma 0.0 / 0.5 Log-normal-Poisson per-step n_loops sampler Huginn (mean=32)
convergence_threshold 0.0 Per-position hidden-state-delta early halt (KL-proxy) Huginn (5e-4)

Each flag has matching __post_init__ validation; a bad value fails at config construction, not mid-step.

Architecture wiring

  • TransformerBlock: post_attn_norm / post_ffn_norm become RMSNorm(dim) under norm_pattern="sandwich", otherwise nn.Identity(). Bit-exact to pre-PR behavior when off.
  • GQAttention / MLAttention: optional q_norm_qk / k_norm_qk over head_dim; applied after RoPE (GQA) and after the nope/rope concat (MLA) so rotation magnitudes are preserved. New cache_reuse parameter lets non-canonical strided loops read shared K/V without rewriting.
  • RecurrentBlock:
    • Early iterations (t < n_loops - k) run under torch.no_grad(); h is detached entering the grad region so backward graphs attach at exactly k loops regardless of forward depth.
    • cache_key is remapped to recurrent_loop_{t % stride} and cache_reuse=True is set for t ≥ stride.
    • Convergence halt is expressed as p = where(converged, 1.0, p), funneling into the existing ACT remainder path — no separate halt mechanism.
    • ACT halted.all() early break is disabled when bptt_truncate_k > 0 so the grad region always runs.

Training script (training/3b_fine_web_edu.py)

  • Per-step n_loops sampling with rank-0 draw + dist.broadcast so grad-accumulation microbatches within the same step agree on depth.
  • Floored at 1, capped at cfg.max_loop_iters to stay within the LoRA per-loop scale table.
  • Disabled when loop_sample_mean == 0.0 (preserves current behavior).

Test plan

  • pytest tests/ — 59/59 pass, including 23 new cases in tests/test_stage1_harmony.py
  • Default-off invariants + __post_init__ guards
  • Sandwich-norm attaches modules and changes forward output
  • QK-norm attaches on both GQA and MLA, forward remains finite
  • Truncated-BPTT changes recurrent gradient norm vs. k=0 (proves detach is active)
  • KV-stride collapses unique recurrent cache_keys to exactly stride entries
  • Convergence-threshold halts aggressively without NaN
  • Log-normal-Poisson sampler stays within bounds and tracks target mean
  • All-flags-on combined smoke test (forward + backward + recurrent-grad presence)

What this does NOT do

Not in this PR (deferred to Stage 2 / Stage 3 per the research brief):

  • MoR per-iteration router
  • RMoE GRU across layer routers
  • Muon optimizer swap
  • MoEUT layer grouping
  • DEQ / implicit backprop (rejected, 88% confidence)

Confidence

Per the research brief: 85% confidence the individual knobs improve stability and parameter-efficiency at R ≥ 16. Untested at ≥ 350M scale — a Stage 2 bench run is required before claiming the Huginn-3.5B headline numbers hold on this architecture.

🤖 Generated with Claude Code

All knobs default-off; existing checkpoints and configs are unaffected.

Config additions on MythosConfig (with __post_init__ validation):
- norm_pattern ("pre"|"sandwich") — post-sublayer RMSNorm per Ding 2021
- use_qk_norm — RMSNorm on Q and K before attention matmul (Henry 2020)
- bptt_truncate_k — truncated BPTT over the recurrent loop (Huginn k=8)
- recurrent_kv_stride — share KV at (i mod stride) (Huginn, 30% mem cut)
- loop_sample_mean / loop_sample_sigma — log-normal-Poisson n_loops sampler
- convergence_threshold — hidden-state-delta early halt (KL-proxy, 5e-4)

Implementation touches:
- TransformerBlock: optional post_attn_norm / post_ffn_norm gated by cfg
- GQAttention / MLAttention: optional head_dim RMSNorm on Q/K + cache_reuse
  param that reads shared KV at stride-aliased cache_keys without rewriting
- RecurrentBlock: no_grad wrapping of early iterations with detach at the
  grad-region boundary, cache_key remapping under stride, Huginn-style
  convergence-forced halt via torch.where on the ACT scalar
- Early-break disabled when bptt_truncate_k>0 so gradient always flows
  through the grad-region iterations
- training/3b_fine_web_edu.py: per-step n_loops sampler (rank-0 draw +
  broadcast under DDP), threaded into model(x, n_loops=...)

Tests (tests/test_stage1_harmony.py, 23 cases):
- defaults-off invariants, __post_init__ guard coverage
- sandwich-norm adds modules and changes forward output
- qk-norm attaches modules on GQA and MLA, forward remains finite
- bptt-truncation changes recurrent gradient norm vs. k=0
- kv_stride collapses unique recurrent cache_keys to `stride` entries
- convergence-threshold=1 halts aggressively without NaN
- log-normal-Poisson sampler stays within bounds and tracks target mean
- all-flags-on smoke test: forward + backward + recurrent-grad presence

Builds on PR kyegomez#41. Research brief (/research, EXHAUSTIVE, 71% confidence,
2026-04-22) recommended these six knobs as the highest-leverage Stage 1
integration pass before any Stage 2 bench-at-350M work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant