Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off)#1
Open
supernavyl wants to merge 1 commit intofixes/scan-remediationfrom
Open
Stage 1: Huginn/MoEUT/Parcae harmony knobs (default-off)#1supernavyl wants to merge 1 commit intofixes/scan-remediationfrom
supernavyl wants to merge 1 commit intofixes/scan-remediationfrom
Conversation
All knobs default-off; existing checkpoints and configs are unaffected.
Config additions on MythosConfig (with __post_init__ validation):
- norm_pattern ("pre"|"sandwich") — post-sublayer RMSNorm per Ding 2021
- use_qk_norm — RMSNorm on Q and K before attention matmul (Henry 2020)
- bptt_truncate_k — truncated BPTT over the recurrent loop (Huginn k=8)
- recurrent_kv_stride — share KV at (i mod stride) (Huginn, 30% mem cut)
- loop_sample_mean / loop_sample_sigma — log-normal-Poisson n_loops sampler
- convergence_threshold — hidden-state-delta early halt (KL-proxy, 5e-4)
Implementation touches:
- TransformerBlock: optional post_attn_norm / post_ffn_norm gated by cfg
- GQAttention / MLAttention: optional head_dim RMSNorm on Q/K + cache_reuse
param that reads shared KV at stride-aliased cache_keys without rewriting
- RecurrentBlock: no_grad wrapping of early iterations with detach at the
grad-region boundary, cache_key remapping under stride, Huginn-style
convergence-forced halt via torch.where on the ACT scalar
- Early-break disabled when bptt_truncate_k>0 so gradient always flows
through the grad-region iterations
- training/3b_fine_web_edu.py: per-step n_loops sampler (rank-0 draw +
broadcast under DDP), threaded into model(x, n_loops=...)
Tests (tests/test_stage1_harmony.py, 23 cases):
- defaults-off invariants, __post_init__ guard coverage
- sandwich-norm adds modules and changes forward output
- qk-norm attaches modules on GQA and MLA, forward remains finite
- bptt-truncation changes recurrent gradient norm vs. k=0
- kv_stride collapses unique recurrent cache_keys to `stride` entries
- convergence-threshold=1 halts aggressively without NaN
- log-normal-Poisson sampler stays within bounds and tracks target mean
- all-flags-on smoke test: forward + backward + recurrent-grad presence
Builds on PR kyegomez#41. Research brief (/research, EXHAUSTIVE, 71% confidence,
2026-04-22) recommended these six knobs as the highest-leverage Stage 1
integration pass before any Stage 2 bench-at-350M work.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stage 1 of the harmony/sophistication evolution described by the 2026-04-22
/researchbrief (EXHAUSTIVE, 71% confidence). Six default-off config flags wire in the highest-leverage Huginn / MoEUT / Parcae primitives without touching existing checkpoint compat. All 59 pre-existing + 23 new tests pass.Stacks on top of kyegomez#41 (
fixes/scan-remediation). Keep in that order; rebase cleanly ontomainonce kyegomez#41 merges.What changed
Config knobs (
MythosConfig, default-off)norm_pattern"pre""sandwich"adds post-sublayer RMSNormsuse_qk_normFalsebptt_truncate_k0recurrent_kv_stride0i mod strideloop_sample_mean / _sigma0.0 / 0.5convergence_threshold0.0Each flag has matching
__post_init__validation; a bad value fails at config construction, not mid-step.Architecture wiring
TransformerBlock:post_attn_norm/post_ffn_normbecomeRMSNorm(dim)undernorm_pattern="sandwich", otherwisenn.Identity(). Bit-exact to pre-PR behavior when off.GQAttention/MLAttention: optionalq_norm_qk / k_norm_qkoverhead_dim; applied after RoPE (GQA) and after the nope/rope concat (MLA) so rotation magnitudes are preserved. Newcache_reuseparameter lets non-canonical strided loops read shared K/V without rewriting.RecurrentBlock:t < n_loops - k) run undertorch.no_grad();his detached entering the grad region so backward graphs attach at exactly k loops regardless of forward depth.cache_keyis remapped torecurrent_loop_{t % stride}andcache_reuse=Trueis set fort ≥ stride.p = where(converged, 1.0, p), funneling into the existing ACT remainder path — no separate halt mechanism.halted.all()early break is disabled whenbptt_truncate_k > 0so the grad region always runs.Training script (
training/3b_fine_web_edu.py)n_loopssampling with rank-0 draw +dist.broadcastso grad-accumulation microbatches within the same step agree on depth.cfg.max_loop_itersto stay within the LoRA per-loop scale table.loop_sample_mean == 0.0(preserves current behavior).Test plan
pytest tests/— 59/59 pass, including 23 new cases intests/test_stage1_harmony.py__post_init__guardsstrideentriesWhat this does NOT do
Not in this PR (deferred to Stage 2 / Stage 3 per the research brief):
Confidence
Per the research brief: 85% confidence the individual knobs improve stability and parameter-efficiency at R ≥ 16. Untested at ≥ 350M scale — a Stage 2 bench run is required before claiming the Huginn-3.5B headline numbers hold on this architecture.
🤖 Generated with Claude Code