Remediation: router-bias wiring, EOS packing, numerical stability, trainer hardening, tests#41
Open
supernavyl wants to merge 6 commits intokyegomez:mainfrom
Open
Remediation: router-bias wiring, EOS packing, numerical stability, trainer hardening, tests#41supernavyl wants to merge 6 commits intokyegomez:mainfrom
supernavyl wants to merge 6 commits intokyegomez:mainfrom
Conversation
…ility MoEFFN ------ - Vectorized routed-expert dispatch via stable argsort + bincount offsets + index_add_. Replaces the O(topk * n_experts) Python loop with one matmul per expert (O(n_experts) dense ops). Preserves exact semantics - tested against a naive reference loop with 1e-5 tolerance. - Added router_bias (non-persistent buffer) and expert_load (non-persistent buffer) to implement DeepSeek-V3 aux-loss-free load balancing. Upstream had router_bias but nothing ever updated it - balancing was silently inert. - update_bias(speed) applies a sign-based update so per-step delta magnitude is bounded by speed regardless of how skewed the load is. Flushes expert_load after each call. - Softmax upcast to fp32 inside fp16/bf16 autocast; clamp_min(1e-9) on the gate-weight renormalization denominator prevents division by underflow. OpenMythos ---------- - New update_router_biases(speed=None, ddp=False) method walks every MoEFFN submodule, optionally all-reduces expert_load across ranks, and applies the bias update. Must be called AFTER optimizer.step(). GQAttention / MLAttention ------------------------- - Softmax upcast to fp32 before cast back to attention dtype. Long-sequence bf16 softmax quantizes the tail and collapses attention toward one-hot or uniform. - Defensive freqs_cis[:T] slice so standalone callers do not have to pre-slice before passing the full precomputed RoPE buffer. LTIInjection ------------ - Lower clamp tightened from -20 to -10 in get_A(). At -20, exp(-exp(-20)) rounds to exactly 1.0 in float32, breaking the strict spectral-radius<1 guarantee under adversarial gradient steps. -10 gives a 4.5e-5 margin below 1.0 that comfortably survives fp32 rounding. MythosConfig ------------ - __post_init__ now validates every hyperparameter at construction time. Bad configs fail now instead of mid-step in a pretraining run. - Added fields: bias_update_speed, loop_rope_theta, lti_b_init, init_std with sensible defaults. - _init_weights uses cfg.init_std; router init scaled by 0.1.
MythosTokenizer changes that the trainer and model both depend on: - vocab_size now returns len(self.tokenizer) (base vocab + added specials), rounded up to vocab_multiple_of (default 128 for tensor-core-friendly embedding widths). HF's tokenizer.vocab_size silently excludes added specials, so a token in that excluded range caused a CUDA device-side assert deep into pretraining. Any nn.Embedding sized from the new property cannot index out of range. - eos_token_id property with a fallback chain: eos -> bos -> all_special_ids[0] -> None. Used by the FineWeb-Edu packer to inject an explicit boundary token between concatenated documents, so the model never sees cross-document attention without a marker. - encode() now silently rejects None and non-str inputs (returns []) and truncates at MAX_CHARS_PER_DOC = 4_000_000 before tokenizing. FineWeb-Edu has pathological outliers that stalled DataLoader workers and OOM'd the tokenizer. - encode_with_eos() method appends the EOS id when defined. Intended for the document packer path. - trust_remote_code=False pinned explicitly so future transformers versions cannot silently start running remote Python.
MoDA (Mixture-of-Depths Attention) is a parallel research-line architecture, not part of the canonical OpenMythos Prelude/Recurrent/Coda model. Moving it to open_mythos.experimental/ makes that boundary explicit: - Public API at the package root stays the canonical architecture. - Experimental components (MoDAConfig, MoDAModel, MoDAAttention, DeepSeekMoE, DeepSeekGate, DeepSeekExpert, RMSNorm, RotaryEmbedding) are importable from open_mythos.experimental with a loud docstring stating no stability guarantees. - The commented-out smoke test block at the bottom of moda.py is deleted - it was a dead debug scaffold, not a test, and encouraging smoke tests to live as commented-out __main__ blocks teaches the wrong pattern.
Comprehensive hardening of training/3b_fine_web_edu.py for long multi-day FSDP runs where the cost of a crash at step 50k is days of wasted compute. Correctness and numerics ------------------------ - Per-microstep NaN/Inf loss guard: non-finite micro-losses are skipped (no backward), so one bad sample cannot poison Adam moment buffers. If every microstep in the accumulation window is non-finite, the whole optimizer.step() is skipped but the step counter still ticks so LR schedule stays monotonic. - Non-finite grad_norm guard after clipping (ShardedGradScaler handles this for the fp16 path but we enforce it uniformly for bf16 too). - Aux-loss-free load balancing is finally driven: after every successful optimizer.step() we call model.update_router_biases(ddp=ddp), which all-reduces expert_load across ranks and applies the DeepSeek-V3 bias update. Without this call the balancing mechanism was silently inert. - EOS injection: tokenization uses encoding.encode_with_eos() so packed documents get a boundary token instead of flowing into each other. - Micro-batch loss accumulated on-device; single .item() per logging window instead of every microstep. Mixed precision --------------- - ShardedGradScaler wired up on the fp16 path (Volta/Pascal). bf16 path runs with FSDP MixedPrecision and no scaler, which is the officially supported combination. - Scaler state round-trips through checkpoints. Reproducibility --------------- - All RNGs seeded (python / numpy / torch / cuda) with per-rank offset for in-process uniqueness. Seed persists through the checkpoint so a resume on a different node draws the same data stream (given the shard is still at the same position). - Checkpoint carries RNG state, scaler state, torch and cuda versions. Graceful shutdown ----------------- - SIGTERM / SIGINT handler marks a cooperative shutdown flag. Main loop polls it between microbatches, breaks cleanly, writes a final atomic checkpoint, barriers, and exits 130. A second signal falls through to default handling so a stuck rank can always be force-killed. Logging ------- - loguru rotating file sink: 100 MB per file, 7-day retention, gz compressed, per-rank file. Non-master ranks silence stderr to avoid interleaving chaos but still log to file for post-mortem. Main rank keeps the default stderr sink. - Exception path captures tracebacks via logger.exception; final save runs in a finally block so a crash still writes the latest state. Misc ---- - cfg.__post_init__() re-run after mutating vocab_size and max_seq_len so an operator who edits them at the CLI gets a clean error early. - Directory fsync after atomic rename so the checkpoint is durable across power loss. - persistent_workers=True on the DataLoader so workers survive between epoch boundaries instead of respawning and re-opening the stream. - zero_grad(set_to_none=True) to decouple grad lifetime from param lifetime under FSDP.
…ents.txt
Three dep manifests existed with conflicting constraints (torch "2.11.0"
exact vs >=2.1.0 vs >=2.11.0) and missing entries (numpy / loguru used by
the trainer but declared nowhere).
pyproject.toml (Poetry, library-facing)
- torch >=2.3.0,<3.0.0 (floor set by ShardedGradScaler import path +
torch.amp.autocast device_type= signature)
- transformers >=4.40.0,<5.0.0
- datasets >=2.18.0,<4.0.0
- New [tool.poetry.group.training] group with numpy and loguru for
users who want the pretraining scripts but not a minimal inference
install.
requirements.txt (inference / library use)
- Same ranges as pyproject but in pip-compatible syntax.
training/requirements.txt (pretraining runs)
- Exact pins (torch==2.11.0, transformers==4.46.3, datasets==3.2.0,
loguru==0.7.3) for node-to-node reproducibility.
- Includes the CUDA 12.4 wheel index.
- Documents: when bumping torch here, bump pyproject too and confirm
the FSDP / autocast APIs the trainer uses still exist.
New pytest modules under tests/ that pin the behavior introduced in the
accompanying fix commits so regressions surface in CI instead of at
step 50k of pretraining.
tests/test_config_validation.py (11 tests)
- Every MythosConfig.__post_init__ guard exercised one axis at a time.
- Baseline config is a known-good fixture other tests can override.
tests/test_moe_router.py (9 tests)
- router_bias and expert_load are buffers, not Parameters.
- expert_load accumulates on training forward; not on eval forward.
- update_bias shifts bias toward underused experts and away from
overused ones.
- update_bias flushes expert_load.
- update_bias with speed=0 is a no-op on bias, still flushes load.
- Sign-based update magnitude is bounded by `speed` even for massive
imbalance (spec invariant).
- OpenMythos.update_router_biases walks every MoE layer.
- Vectorized dispatch matches a naive per-token loop at 1e-5 tolerance -
this is the safety net for the argsort + index_add_ optimization.
tests/test_tokenizer.py (rewritten, 16 tests)
- vocab_size >= len(tokenizer) invariant.
- vocab_size rounded to multiple of 128 by default; configurable.
- encode rejects None/non-str; truncates oversized inputs at max_chars.
- encode_with_eos appends EOS when defined, plain encode otherwise.
- encode_with_eos on empty returns empty (no lone EOS emitted).
- Every emitted id is < vocab_size (embedding safety invariant).
9 tasks
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive remediation of issues found during a multi-dimensional audit of the codebase. Six logical commits that can be reviewed independently:
fix(moe+attn)— vectorized dispatch, wired router bias, numerical stabilityfix(tokenizer)— correct vocab sizing + EOS-awareencode_with_eosrefactor— movemoda.pytoopen_mythos.experimentalfeat(training)— harden FineWeb-Edu pretraining scriptchore(deps)— reconcile pyproject / requirements.txt / training/requirements.txttest— cover config validation, router bias update, vectorized dispatchHeadline fixes
1.
router_biaswas silently inert (correctness bug)Upstream defines
MoEFFN.router_biasas a buffer of zeros but no code path ever updated it — the advertised DeepSeek-V3 aux-loss-free load balancing was not happening. This PR adds:MoEFFN.expert_loadbuffer + bincount accumulation on each training forward.MoEFFN.update_bias(speed)with a sign-based update (magnitude bounded byspeedregardless of imbalance, per DeepSeek-V3 Eq. 16). Flushes load after each call.OpenMythos.update_router_biases(speed=None, ddp=False)that walks every MoE layer and all-reduces load across FSDP ranks before updating.model.update_router_biases(ddp=ddp)after every successfuloptimizer.step().2. No EOS between packed documents (training-quality bug)
The
FineWebEduDatasetpacker usedencode()which does not append EOS, so concatenated documents flowed into each other with no boundary marker — the model learned spurious cross-document attention.MythosTokenizer.encode_with_eos()appends EOS (witheos -> bos -> all_special_ids[0] -> Nonefallback).3. Vectorized MoE dispatch (performance)
The naive per-token loop was
O(topk * n_experts)Python iterations per forward pass. Replaced with a stableargsort+bincount+index_add_pipeline —O(n_experts)dense ops. Semantic equivalence verified against a reference loop implementation at 1e-5 tolerance (seetests/test_moe_router.py::test_vectorized_matches_naive_dispatch).4. Numerical stability
GQAttention,MLAttention, andMoEFFNrouters. bf16/fp16 softmax quantizes the tail and collapses attention to one-hot or uniform at long context.clamp_min(1e-9)on the gate-weight renormalization denominator so a fully underflowed set of topk scores does not divide by zero.log_dt + log_Aclamp lower bound raised from-20to-10. At-20,exp(-exp(-20))rounds to exactly1.0in fp32 and breaks the strictρ(A) < 1invariant. At-10it saturates at~1 - 4.5e-5, safely below 1 in fp32.5.
vocab_sizetrap (correctness bug)MythosTokenizer.vocab_sizewasself.tokenizer.vocab_size, which for HF tokenizers is the base vocab excluding added specials. Any added special token would index past the model's embedding matrix and trigger a CUDAdevice-side assertdeep into pretraining.len(self.tokenizer), rounded up tovocab_multiple_of(default128for tensor-core-friendly widths).6. Config validation at construction time
MythosConfig.__post_init__now validates 12+ invariants (attn_type enum, dim/n_heads divisibility, GQA grouping, MLA rope even, loop_dim even, MoE sizing, ACT/dropout ranges, LoRA positivity, vocab_size/max_seq_len positivity). Bad configs fail atMythosConfig(...)instead of mid-step hours into a run.7. Trainer hardening for long runs
grad_normguard.ShardedGradScaleron the fp16 path.logururotating file sink (100 MB / 7-day retention, gz compressed, per-rank)..item()per log window).fsyncafter atomic rename for power-loss durability.cfg.__post_init__()re-run after mutatingvocab_size/max_seq_len.8.
moda.pymoved toopen_mythos.experimentalMoDA (Mixture-of-Depths Attention) is a parallel research line, not part of the canonical Prelude/Recurrent/Coda model. Subpackage boundary makes stability guarantees explicit. Commented-out smoke-test block deleted (dead scaffold, not a test).
9. Defensive
freqs_cis[:T]slice in attention modulesStandalone callers (tests, ad-hoc scripts) no longer have to pre-slice the full precomputed RoPE buffer to match the current T. Fixes a pre-existing
test_main.py::TestGQAttention::test_output_shapecrash.Test plan
pytest test_main.py tests/— 103 passed (0 failed).test_moe_router.py::test_vectorized_matches_naive_dispatchverifies the fast dispatch path matches a reference loop at 1e-5.test_moe_router.py::test_update_bias_magnitude_bounded_by_speedverifies the sign-based update even on 1M-imbalance load.test_config_validation.pycovers 11 distinct__post_init__guards.test_tokenizer.py::test_encode_ids_within_vocab— embedding-safety invariant.python -m py_compileon every changed file.Breaking changes
MythosConfignow validates in__post_init__. Configs that relied on silently invalid hyperparameter combinations will now raiseValueErrorat construction.MythosTokenizer.vocab_sizemay return a larger value than before (base vocab + added specials, rounded to 128). Any code that hardcoded the old value will need to rebuild its embedding matrix.open_mythos.modaimport path →open_mythos.experimental.moda(orfrom open_mythos.experimental import MoDAModel). The old import path no longer works.Honest caveats
expert_loadbookkeeping differs (intentionally, since the fast path accumulates it and the naive reference does not). That's tested separately intest_expert_load_accumulates_on_forward.