Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off) by supernavyl · Pull Request #1 · supernavyl/OpenMythos

supernavyl · 2026-04-22T00:46:08Z

Summary

Stage 1 of the harmony/sophistication evolution described by the 2026-04-22 /research brief (EXHAUSTIVE, 71% confidence). Six default-off config flags wire in the highest-leverage Huginn / MoEUT / Parcae primitives without touching existing checkpoint compat. All 59 pre-existing + 23 new tests pass.

Stacks on top of kyegomez#41 (fixes/scan-remediation). Keep in that order; rebase cleanly onto main once kyegomez#41 merges.

What changed

Config knobs (`MythosConfig`, default-off)

Flag	Default	Purpose	Source
`norm_pattern`	`"pre"`	`"sandwich"` adds post-sublayer RMSNorms	Ding 2021, Huginn
`use_qk_norm`	`False`	RMSNorm on Q and K before matmul	Henry 2020, ViT-22B
`bptt_truncate_k`	`0`	Truncated BPTT through last-k recurrent loops	Huginn (k=8)
`recurrent_kv_stride`	`0`	KV-cache sharing across loops at `i mod stride`	Huginn (stride=16)
`loop_sample_mean / _sigma`	`0.0 / 0.5`	Log-normal-Poisson per-step n_loops sampler	Huginn (mean=32)
`convergence_threshold`	`0.0`	Per-position hidden-state-delta early halt (KL-proxy)	Huginn (5e-4)

Each flag has matching __post_init__ validation; a bad value fails at config construction, not mid-step.

Architecture wiring

TransformerBlock: post_attn_norm / post_ffn_norm become RMSNorm(dim) under norm_pattern="sandwich", otherwise nn.Identity(). Bit-exact to pre-PR behavior when off.
GQAttention / MLAttention: optional q_norm_qk / k_norm_qk over head_dim; applied after RoPE (GQA) and after the nope/rope concat (MLA) so rotation magnitudes are preserved. New cache_reuse parameter lets non-canonical strided loops read shared K/V without rewriting.
RecurrentBlock:
- Early iterations (t < n_loops - k) run under torch.no_grad(); h is detached entering the grad region so backward graphs attach at exactly k loops regardless of forward depth.
- cache_key is remapped to recurrent_loop_{t % stride} and cache_reuse=True is set for t ≥ stride.
- Convergence halt is expressed as p = where(converged, 1.0, p), funneling into the existing ACT remainder path — no separate halt mechanism.
- ACT halted.all() early break is disabled when bptt_truncate_k > 0 so the grad region always runs.

Training script (`training/3b_fine_web_edu.py`)

Per-step n_loops sampling with rank-0 draw + dist.broadcast so grad-accumulation microbatches within the same step agree on depth.
Floored at 1, capped at cfg.max_loop_iters to stay within the LoRA per-loop scale table.
Disabled when loop_sample_mean == 0.0 (preserves current behavior).

Test plan

pytest tests/ — 59/59 pass, including 23 new cases in tests/test_stage1_harmony.py
Default-off invariants + __post_init__ guards
Sandwich-norm attaches modules and changes forward output
QK-norm attaches on both GQA and MLA, forward remains finite
Truncated-BPTT changes recurrent gradient norm vs. k=0 (proves detach is active)
KV-stride collapses unique recurrent cache_keys to exactly stride entries
Convergence-threshold halts aggressively without NaN
Log-normal-Poisson sampler stays within bounds and tracks target mean
All-flags-on combined smoke test (forward + backward + recurrent-grad presence)

What this does NOT do

Not in this PR (deferred to Stage 2 / Stage 3 per the research brief):

MoR per-iteration router
RMoE GRU across layer routers
Muon optimizer swap
MoEUT layer grouping
DEQ / implicit backprop (rejected, 88% confidence)

Confidence

Per the research brief: 85% confidence the individual knobs improve stability and parameter-efficiency at R ≥ 16. Untested at ≥ 350M scale — a Stage 2 bench run is required before claiming the Huginn-3.5B headline numbers hold on this architecture.

🤖 Generated with Claude Code

All knobs default-off; existing checkpoints and configs are unaffected. Config additions on MythosConfig (with __post_init__ validation): - norm_pattern ("pre"|"sandwich") — post-sublayer RMSNorm per Ding 2021 - use_qk_norm — RMSNorm on Q and K before attention matmul (Henry 2020) - bptt_truncate_k — truncated BPTT over the recurrent loop (Huginn k=8) - recurrent_kv_stride — share KV at (i mod stride) (Huginn, 30% mem cut) - loop_sample_mean / loop_sample_sigma — log-normal-Poisson n_loops sampler - convergence_threshold — hidden-state-delta early halt (KL-proxy, 5e-4) Implementation touches: - TransformerBlock: optional post_attn_norm / post_ffn_norm gated by cfg - GQAttention / MLAttention: optional head_dim RMSNorm on Q/K + cache_reuse param that reads shared KV at stride-aliased cache_keys without rewriting - RecurrentBlock: no_grad wrapping of early iterations with detach at the grad-region boundary, cache_key remapping under stride, Huginn-style convergence-forced halt via torch.where on the ACT scalar - Early-break disabled when bptt_truncate_k>0 so gradient always flows through the grad-region iterations - training/3b_fine_web_edu.py: per-step n_loops sampler (rank-0 draw + broadcast under DDP), threaded into model(x, n_loops=...) Tests (tests/test_stage1_harmony.py, 23 cases): - defaults-off invariants, __post_init__ guard coverage - sandwich-norm adds modules and changes forward output - qk-norm attaches modules on GQA and MLA, forward remains finite - bptt-truncation changes recurrent gradient norm vs. k=0 - kv_stride collapses unique recurrent cache_keys to `stride` entries - convergence-threshold=1 halts aggressively without NaN - log-normal-Poisson sampler stays within bounds and tracks target mean - all-flags-on smoke test: forward + backward + recurrent-grad presence Builds on PR kyegomez#41. Research brief (/research, EXHAUSTIVE, 71% confidence, 2026-04-22) recommended these six knobs as the highest-leverage Stage 1 integration pass before any Stage 2 bench-at-350M work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off)#1

Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off)#1
supernavyl wants to merge 1 commit intofixes/scan-remediationfrom
stage1/harmonize

supernavyl commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

supernavyl commented Apr 22, 2026

Summary

What changed

Config knobs (MythosConfig, default-off)

Architecture wiring

Training script (training/3b_fine_web_edu.py)

Test plan

What this does NOT do

Confidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Config knobs (`MythosConfig`, default-off)

Training script (`training/3b_fine_web_edu.py`)