Add ablation flags (disable_act, break_recurrence, etc.) to enable depth-extrapolation training by tonyzdev · Pull Request #58 · kyegomez/OpenMythos

tonyzdev · 2026-04-23T19:48:37Z

Summary

Adds five opt-in flags on MythosConfig that expose the individual mechanisms controlling how each recurrent loop iteration differs from every other. Defaults preserve the original architecture exactly — this is a zero-semantics-change addition for existing users. The flags exist to make the mechanistic experiments in #28 first-class supported operations rather than external monkey-patches.

Flag	Default	Effect when flipped
`loop_index_embedding`	`True`	Skip the sinusoidal loop-index injection into `h` at each step (makes loops anonymous)
`use_per_loop_lora`	`True`	Skip the per-loop `LoRAAdapter.scale` application
`disable_act`	`False`	★ Return `h` from the last loop iteration instead of the ACT-weighted sum
`freeze_moe_router`	`False`	Config-level signal that the `MoEFFN.router` weight should be frozen at init
`break_recurrence`	`False`	Replace `h = A·h_t + B·e + trans_out` with `h = trans_out` (drop LTI state carry)

Why it matters

A 13-run ablation study (same 117.8M MoE+MLA architecture, same ~491M tokens FineWeb-Edu, 4 rounds of experiments totaling ~$150 of H100 time — tracked in #28) found that disable_act is the only flag out of five tested that qualitatively changes the inference-time loop-scaling curve from the V-shaped / flat-plateau behaviour of the default architecture to a monotonically decreasing-and-saturating curve. Specifically, the combination disable_act=True + stochastic-depth training (sampling n_loops uniformly per step from e.g. {4,6,8,12,16}) produces:

n_loops:  1    2    4    6    8   12   16
PPL:    131   78   63   60   60   59   60    # disable_act + random-depth training
PPL:   1217  401   65   65   65   65   65    # default ACT + random-depth (current behaviour)

— the only configuration in the 13-run matrix that matches the Saunshi et al. 2025 / Parcae depth-extrapolation shape the README argues for.

What this PR does / doesn't do

Does:

Adds the 5 flags to MythosConfig.
Gates loop_index_embedding(h, t, …), self.lora(trans_out, t), self.injection(h, e, trans_out), and the return h_out in RecurrentBlock.forward on the relevant flags.
Adds a short "Depth-extrapolation recipe" section to README.md documenting the empirical finding and the specific recipe.

Doesn't:

Change any default semantics. With all flags at their defaults the forward pass is byte-identical to current main.
Touch the training script — stochastic-depth sampling is a training-recipe concern and is implemented upstream of the model (just vary the n_loops argument to model(...) per step). See Add experiments/ suite for inference-time loop scaling validation #27 for a reference experiments/train.py that does this with --loop_sample_mode random_set --loop_choices.
Implement freeze_moe_router in a way that changes the model; the flag is a config-level signal that training code can read to call model.recurrent.block.ffn.router.weight.requires_grad_(False) after construction. This keeps the model code pure.

Test plan

python -c "from open_mythos import OpenMythos; from open_mythos.variants import mythos_1b; m = OpenMythos(mythos_1b())" — no regressions at default flags.
All 4 non-default combinations (anonymous, disable_act, break_recurrence, freeze_lti via external requires_grad_(False)) forward and backward end-to-end in bf16 on H100 in the 13-run study in Empirical test of the depth-extrapolation claim: U-shape with fixed-loop training, flat plateau with random-loop training #28.
disable_act=True + stochastic-depth training produces the monotonic loop-scaling curve reported in Empirical test of the depth-extrapolation claim: U-shape with fixed-loop training, flat plateau with random-loop training #28.

Fix dtype leaks that break bfloat16 training #26 (bf16 dtype fixes) — independent but complementary.
Add experiments/ suite for inference-time loop scaling validation #27 (experiments/ suite) — provides the training-script-side support (--loop_sample_mode, --loop_choices) for exercising these flags at scale.
Empirical test of the depth-extrapolation claim: U-shape with fixed-loop training, flat plateau with random-loop training #28 — full methodology and 13-run comparison.

Made with Cursor

Adds five opt-in knobs on MythosConfig that expose the mechanisms controlling how each recurrent loop iteration differs from every other. All defaults preserve the original architecture exactly; the diff is a zero-impact addition for users who don't set any of them. - loop_index_embedding (default True): toggle the sin/cos loop-index signal injected into `h` at each step. - use_per_loop_lora (default True): toggle the per-loop LoRA scale. - disable_act (default False): return the final-iteration `h` instead of the ACT-weighted sum over all iterations. - freeze_moe_router (default False): indicate whether MoEFFN.router should be excluded from grad updates; actual requires_grad must be set externally after model construction (kept here on the config so training code has a single source of truth). - break_recurrence (default False): replace the LTI update `h = A·h_t + B·e + trans_out` with `h = trans_out`, killing the state-carry across loops while keeping the transformer block + the `e`-injection via `norm(h + e)`. README gets a new "Depth-extrapolation recipe" section documenting the empirical finding that `disable_act` is the only flag (out of the five above plus LTI-parameter freezing) that qualitatively changes the inference-time loop-scaling curve, and gives the specific training recipe (random `n_loops` per step + `disable_act`) that yields monotonic PPL decrease with inference depth. Full methodology, 13-run ablation, and loop_scaling plots are tracked in kyegomez#28. No changes to model semantics at the default flag values; only new control surface is exposed.

tonyzdev · 2026-04-24T01:23:40Z

Validation: the recipe extrapolates to 4× training depth

Retrained disable_act_random with exactly the flags this PR documents (cfg.disable_act=True + random n_loops ∈ {4,6,8,12,16} per training step), then ran an extended inference loop-grid sweep out to n_loops=64 — 4× beyond the training maximum — on a fresh held-out FineWeb-Edu slice. Same 117.8M MoE+MLA config, same 491M training tokens, same evaluation protocol (--skip_docs 2_000_000, 50 batches × 8 × 1024 tokens).

Results (sorted by n_loops)

n_loops	default (looped_random)	PR recipe (disable_act_random)	Δ vs default
1	1217.0	131.3	9.3× better
2	401.3	77.7	5.2× better
4	65.3	62.8	+4%
6	65.2	60.3	+8%
8	65.2	59.7	+9%
12	65.2	59.5 (min)	+10%
16	65.2	59.6	+9%
24	—	60.5	—
32	—	61.2	—
48	—	61.6	—
64	—	61.7	—

Key signals

Monotonic decrease + saturation within training range. PPL drops smoothly from 131 @ n_loops=1 to 59.5 @ n_loops=12, then stays within 0.2% up to n_loops=16. This is the Saunshi et al. 2025 / Parcae-style depth-scaling shape the README argues for.
Generalizes beyond training maximum. Training's max n_loops is 16, but inference at n_loops=64 (4× beyond) degrades by only 3.8% (59.5 → 61.7). Compare with looped_8 in Empirical test of the depth-extrapolation claim: U-shape with fixed-loop training, flat plateau with random-loop training #28 where inference at n_loops=16 (2× above its fixed training value of 8) degrades by 195% (57.9 → 170.7). The recipe has learned something genuinely generalizable across loop depths, not just memorized the training distribution.
No catastrophic collapse at low n_loops. Even at n_loops=1 (0.25× training minimum) PPL is 131 — worse than the trained band but still usable, and 9× lower than default under the same training schedule.
ρ(A) adapts. The LTI A drifts from init 0.3672 → 0.3242 during training (only disable_act + random out of 13 configurations in Empirical test of the depth-extrapolation claim: U-shape with fixed-loop training, flat plateau with random-loop training #28 induces this), presumably because the model must now make a usable h at every depth instead of relying on ACT reweighting to smooth things out.

Caveat

The curve is monotonic-decreasing-then-saturating. It's not "more inference loops always keeps helping" — past n_loops=12 the PPL plateaus, and past n_loops=24 it very gently rises (+0.4 PPL from 60.5 → 61.7 across 24 → 64 loops). So the T=8 × L=64 regime Parcae analyzes as "scale training recurrence and inference token budget jointly" isn't reproduced here with just 16-max training. If one wanted PPL to keep improving at n_loops>16, the obvious next experiment is to widen the training-time loop_choices to e.g. {8, 16, 32, 64}.

Summary

The recipe in this PR's README section — cfg.disable_act = True + stochastic n_loops training — reproducibly turns OpenMythos from a model whose inference-time PPL curve is either a sharp V (fixed training) or a dead-flat plateau with catastrophic low-n_loops collapse (random training without disable_act), into a model with a monotonic-decreasing-to-saturation inference-scaling curve that generalizes 4× beyond its training loop range with < 4% degradation.

This validates the PR on its stated goal.

Raw data: eval_extended.json — 11-point loop-grid, 50 batches of held-out FineWeb-Edu per point, reproducible in ~4h on a single H100 with the flags on this PR.

qodo-ai-reviewer · 2026-04-24T07:26:06Z

Hi, With cfg.disable_act=True, RecurrentBlock.forward() can still break early based on ACT halting (halted.all()), so it may run fewer than the caller-requested n_loops iterations. This makes loop depth ACT-dependent and undermines stochastic depth sampling / depth-extrapolation experiments that expect n_loops to be the true executed depth.

Severity: action required | Category: correctness

How to fix: Honor n_loops when disable_act

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

When MythosConfig.disable_act=True, RecurrentBlock.forward() still uses ACT halting to early-exit the loop, meaning the executed loop count can be less than the passed n_loops. This breaks the documented depth-extrapolation training recipe that relies on explicitly controlling loop depth.

Issue Context

The disable_act flag is intended to change output aggregation (return last-iteration h), but currently it also (indirectly) changes effective depth because ACT halting can terminate the loop early.

The README recipe suggests sampling n_loops during training while using disable_act=True.

Fix Focus Areas

open_mythos/main.py[869-912]

If cfg.disable_act is True, do not early-break on halted.all().

Consider skipping ACT probability/halting computation entirely in this mode (since its outputs aren’t used), and always run exactly n_loops iterations.

Ensure KV-cache behavior remains consistent (it already avoids early-break when kv_cache is not None).

Spotted by Qodo code review - free for open-source projects.

tonyzdev added a commit to tonyzdev/OpenMythos that referenced this pull request Apr 24, 2026

Add PR kyegomez#58 extended-eval validation figure (n_loops up to 64)

42411cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ablation flags (disable_act, break_recurrence, etc.) to enable depth-extrapolation training#58

Add ablation flags (disable_act, break_recurrence, etc.) to enable depth-extrapolation training#58
tonyzdev wants to merge 1 commit intokyegomez:mainfrom
tonyzdev:feat/depth-extrapolation-knobs

tonyzdev commented Apr 23, 2026

Uh oh!

tonyzdev commented Apr 24, 2026

Uh oh!

qodo-ai-reviewer commented Apr 24, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tonyzdev commented Apr 23, 2026

Summary

Why it matters

What this PR does / doesn't do

Test plan

Related

Uh oh!

tonyzdev commented Apr 24, 2026

Validation: the recipe extrapolates to 4× training depth

Results (sorted by n_loops)

Key signals

Caveat

Summary

Uh oh!

qodo-ai-reviewer commented Apr 24, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants