feat(train): SPEC §82 P2-B — warn on corpus wrap-around to expose data starvation by noahgift · Pull Request #1707 · paiml/aprender

noahgift · 2026-05-15T22:13:15Z

Summary

New ShardBatchIter::with_warn_on_wrap_around(true) opt-in flag that prints a stderr line each time the corpus cycles past its last shard.
apr pretrain enables it by default so operators can detect when --num-steps exceeds corpus capacity (per Chinchilla D ≈ 20·N).
Discharges §82's P2-B item (Δship +1, prevention, ~30 LOC + 2 tests).

Motivation

Methodology lesson #29 (Class 3 packaging defects come in waves) and lesson #18 (predict-then-verify) both surface data starvation as a recurring class of MODEL-2 convergence failure. The 9.75 val_loss floor in §49 was eventually traced to 4× corpus memorization (project_2026_04_27_4x_corpus_memorization_disproof.md) — silent wrap-around masked the cause for weeks.

Now: 18M-token corpus + 5K steps × batch=16 × seq=512 = 41M tokens needed → wrap-around fires 2.3× during the run, with an explicit stderr line each time.

Test plan

cargo test -p aprender-train --lib shard_reader → 7/7 PASS
2 new tests: warn_on_wrap_around_does_not_break_iteration + warn_without_wrap_is_inert
5 prior tests unaffected (backward compatible)
cargo build -p apr-cli --bin apr succeeds

Backward compatibility

Default OFF. Tests and library users see no change unless they opt in via .with_warn_on_wrap_around(true). Only apr pretrain enables it.

🤖 Generated with Claude Code

…a starvation ShardBatchIter::with_warn_on_wrap_around(true) prints a stderr line each time the corpus cycles. apr pretrain enables it by default so operators can detect when --num-steps exceeds the corpus capacity (per Chinchilla scaling: train tokens D ≈ 20·N). Without this warning, a small corpus silently cycles 5-10× per run and the model memorizes instead of generalizing. The §49 from-scratch pivot empirically observed this on a 9.75 val_loss floor that turned out to be 4× corpus memorization (see project_2026_04_27_4x_corpus_ memorization_disproof.md). Empirical example: 18M-token corpus, 5K-step run, batch=16 seq=512 = 41M training tokens needed → 2.3× wrap-around. The new warning fires twice during the run, alerting the operator before val_loss plateaus. Backward compatible: default OFF. Tests/library users see no change unless they opt in via .with_warn_on_wrap_around(true). Test plan: - 2 new unit tests in shard_reader (with-wrap + without-wrap inert) - 5 prior tests unaffected - 7/7 shard_reader::tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 15, 2026 22:13

noahgift merged commit f9b64c9 into main May 15, 2026
18 of 21 checks passed

noahgift deleted the feat/p2b-warn-on-wrap-around branch May 15, 2026 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(train): SPEC §82 P2-B — warn on corpus wrap-around to expose data starvation#1707

feat(train): SPEC §82 P2-B — warn on corpus wrap-around to expose data starvation#1707
noahgift merged 1 commit into
mainfrom
feat/p2b-warn-on-wrap-around

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Summary

Motivation

Test plan

Backward compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant