feat(pretrain): SPEC §82 P1-A — Chinchilla compute-optimal gate warning by noahgift · Pull Request #1708 · paiml/aprender

noahgift · 2026-05-15T22:20:27Z

Summary

When apr pretrain --init <apr> runs, compute param count N from the init model's arch dims and check it against train tokens D = num_steps × batch_size × seq_length. Emit a stderr warning when D is below Chinchilla compute-optimal target (D ≈ 20·N per arXiv:2203.15556).

D < 5·N → SEVERE: model will memorize, not generalize
D < 20·N → BELOW-OPTIMAL: model has room for more training

Non-fatal warning with a suggested --num-steps value. Triggered only on --init paths.

Discharges §82's P1-A item (Δship +1, prevention, ~75 LOC).

Motivation

SHIP-TWO-001 has spent multiple sessions debugging convergence failures that turned out to be under-training (most recently §82's val_loss=4.7111 plateau on 2700 steps of a 500M-param model — D ≈ 22M tokens vs Chinchilla target 10B = 0.2% of compute-optimal). A startup-time warning surfaces this immediately instead of after a 40-min compute burn.

Test plan

cargo test -p apr-cli --lib estimate_param_count → 2/2 PASS
estimate_param_count_qwen2_05b_within_2x validates the formula gives ~494M for Qwen2.5-0.5B dims
estimate_param_count_scales_with_layers validates monotonic scaling with depth
cargo build -p apr-cli --bin apr succeeds

Backward compatibility

Pure additive — only emits to stderr when --init is provided AND ratio is below threshold. No behaviour change for compute-optimal or from-scratch runs.

🤖 Generated with Claude Code

When `apr pretrain --init <apr>` runs, compute the param count N from the init model's arch dims and check it against train tokens D = num_steps × batch_size × seq_length. Per Chinchilla (arXiv:2203.15556), compute-optimal pretraining requires D ≈ 20·N. Two warning thresholds: - D < 5·N → SEVERE: model will memorize, not generalize - D < 20·N → BELOW-OPTIMAL: model has room for more training Non-fatal — operators may have legitimate reasons to deviate (resume runs, ablation studies). The warning includes a suggested `--num-steps` value to reach 20·N. Triggered only on --init paths (from-scratch synthetic runs are exempt — operator knows what they're doing). Test plan: - estimate_param_count() with Qwen2.5-0.5B dims gives within 2× of 494M - estimator scales appropriately with num_hidden_layers - 2/2 P1-A tests PASS Discharges §82 P1-A item (Δship +1, prevention, ~75 LOC). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…, promote P2-C An external audit applied Chinchilla math (Hoffmann et al. 2022) to the v1.0.0 roadmap's P2-A2 dispatch plan and pre-falsified it BEFORE the run: N (Qwen-0.5B init) ≈ 494M params §82 P2-A consumed D ≈ 22M tokens Chinchilla compute-optimal D = 20·N = 9.88B Empirical ratio = 0.04× (catastrophically under-provisioned) Full available qwen-v2 corpus (1.24B) only reaches 0.125× The val_loss=4.71 plateau + repetitive `č č č č` gibberish are the Holtzman et al. 2019 neural text degeneration signature — binding constraint is data diversity, not compute. P2-A2 (more steps on same data) cannot break the plateau. Four engineering actions (audit Rec 1-4): 1. P2-C (widen corpus to > 2B tokens via the-stack-v2 + codeparrot) is now the highest-EV dispatch; P2-A2 is downgraded to fallback. 2. P0-J NEW item: convert Chinchilla gate from warning (PR #1708) to hard blocker (fail-fast at D/N < 10× unless --force-under-provisioned). 3. P1-B/C/P3-A deferred until val_loss < 3.0 (was 4.0). Perplexity > 20 means no zero-shot reasoning capability — wasted eval compute. 4. Methodology lesson #30: a-priori theoretical falsification saves compute. Symmetric complement to #18 predict-then-verify. Changes: - audits/albor-370.md — external audit text (preserved verbatim, added by reviewer) - albor-370m-roadmap.md v2.0.0 — audit-driven reprioritization section, P2-C promoted, P2-A2 downgraded with pre-falsification notice, P0-J added, P1-B/C deferred to val_loss < 3.0, 4-week plan rewritten (week 1 is data engineering, NOT training dispatch). - ship-model-2-spec.md §83 — historical record of the pre-falsification, Five-Whys on EV-rank failure mode, methodology lesson #30 documented, ship % stays at 79 pending P0-I/J + P2-C. Memory: - feedback_a_priori_theoretical_falsification.md — new lesson #30 with 4-check pre-flight template for `apr pretrain`. - MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…iles (#1710) * docs(spec): split SHIP-TWO-001 v3.28.0 into per-model + shared files The 8,468-line ship-two-models-spec.md has accumulated 60+ sections with MODEL-1 and MODEL-2 content interleaved chronologically. Per user request, split into three companion files preserving original §N section markers verbatim (so cross-references in git history, PR descriptions, memory files, and contracts remain valid). New layout: docs/specifications/aprender-train/ ├── ship-two-models-spec.md (45-line index, was 8468) ├── ship-model-1-spec.md (3399 lines, MODEL-1 specific) ├── ship-model-2-spec.md (3290 lines, MODEL-2 specific) └── ship-shared-methodology.md (1855 lines, foundation + cross-cutting) Classification: - MODEL-1 (27 sections): §4 base, §7.1, §12 expedited, §15-§17/§23/§27/ §30-§32/§40/§46-§48 SHIP-007 chain, §58 release, §61, §63, §67-§71 SHIP-005, §72 5-AC cascade, §73-§74 LM head, §75 100%, §76 v0.33.0. - MODEL-2 (28 sections): §5 base, §7.2, §14 Task #132, §19-§20, §22, §24-§25 corpus, §26 P-plan, §33-§35 retrain+distill, §42-§43, §49 pivot, §50-§57 §50.4 cascade, §77-§82 step 5g + P2-A. - SHARED (17 sections): §1-§3 foundation, §6-§11, §13 retrospective, §18 status snapshot, §36 plain-language, §41/§44/§45 CPU-GPU parity. Total content: 8,544 lines (original 8,468 + 76 lines of new file headers). Zero content loss verified by section count: 72 classified × line ranges sum to original file length. Original v3.28.0 file recoverable from git via: git show b3ab72f^:docs/specifications/aprender-train/ship-two-models-spec.md User decisions: - 3-file layout (two specs + shared appendix) - Original ship-two-models-spec.md replaced with 1-page index - Original §N numbers preserved per file (non-contiguous within each) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): add lineage repo references for MODEL-1 + MODEL-2 Both models originated as standalone GitHub repos before APR-MONO consolidation: - MODEL-1 → paiml/apr-leaderboard (last commit 2026-04-05) — carries the original 28 distillation contracts that were promoted into the aprender monorepo. - MODEL-2 → paiml/albor (last commit 2026-04-05) — carries 54/54 authored contracts, the ALB-* ticket system, and the v28/v29 training history (v28 stopped at step 11K, peaked at perplexity 38.53). Adds a "Lineage" subsection at the top of each per-model spec and a "Repository lineage" table to the index. No content changes; pure historical-provenance documentation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): rename MODEL-1/MODEL-2 to aprender-coder-7b / aprender-coder-370m Adopt HuggingFace-style descriptive size+role names as the public model identifiers while keeping MODEL-1 / MODEL-2 as stable numeric document IDs (preserved across renames so PR/git/contract cross-references stay valid). - MODEL-1 → `aprender-coder-7b` (distilled 7B coder teacher) - MODEL-2 → `aprender-coder-370m` (sovereign 370M Python student) Codenames (`apr-leaderboard`, `albor`) remain in the lineage tables as the historical repo names. Per-file changes: - ship-model-1-spec.md: title + companion-spec links + version 1.0.0→1.1.0 - ship-model-2-spec.md: title + companion-spec links + version 1.0.0→1.1.0 - ship-two-models-spec.md: new "Model identifiers" table + updated spec-layout + repository-lineage + section-ownership entries Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): clarify aprender-coder-* family naming convention vs HF redistribution slug User question surfaced that the "isn't the convention to keep origin name" intuition only applies to redistribution slugs (where Unsloth-style preserves upstream lineage). At the spec/family level, multi-model authors like Mistral, DeepSeek, Qwen, Microsoft Phi use a coherent family prefix — `mistral-7b`, `deepseek-coder-1.3b`, `qwen2.5-coder-7b`, `phi-3.5-mini` — that acts as the brand identity. `aprender-coder-7b` and `aprender-coder-370m` follow this convention: family prefix `aprender` + variant `coder` + size suffix. The HF redistribution slug `paiml/qwen2.5-coder-7b-apache-q4k-v1` keeps upstream Qwen lineage in its name (because MODEL-1 is a quantized/relicensed derivative). Both names point to the same artifact at different levels of identity. Updated the model identifiers table to: - Add the "HF redistribution slug" column showing the relationship explicitly - Add a "Naming convention" paragraph citing Mistral/DeepSeek/Qwen/Phi precedent - Add a "Family vs. redistribution" paragraph explaining why both names exist No content changes — pure clarification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): adopt Unsloth/Bartowski redistributor naming — aprender/{base}-{tags} User clarified the convention: model authors who redistribute (Unsloth, Bartowski, TheBloke) preserve upstream identity in their family name, not strip it. Pattern is {org}/{upstream-base}-{license-tag}-{quant-tag} or {org}/{codename}-{size} for sovereign work. New family names: - MODEL-1: aprender/qwen2.5-coder-7b-apache-q4k (was: aprender-coder-7b — dropped the Qwen lineage incorrectly) - MODEL-2: aprender/albor-370m (was: aprender-coder-370m — kept original albor codename for sovereign work) Examples cited in the spec: - unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit - bartowski/Qwen2.5-Coder-7B-Instruct-GGUF - TheBloke/CodeLlama-7B-Instruct-GGUF The HF artifact slug (paiml/...-v1) stays as the published handle — the family name (aprender/...) is the spec-level identity. Both refer to the same artifact. Per-file changes: - ship-model-1-spec.md: title + name + companion-spec link, v1.1.0→v1.2.0 - ship-model-2-spec.md: title + name + companion-spec link, v1.1.0→v1.2.0 - ship-two-models-spec.md: model identifiers table + naming convention paragraphs cite the Unsloth/Bartowski/TheBloke precedent; spec-layout, repository-lineage, and section-ownership entries updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): add albor-370m-roadmap.md — forward-looking active-work spec The 3,290-line MODEL-2 spec is the historical record of §5-§82 amendments — authoritative for what happened, but unwieldy as the working doc for what to do next. The §80 prioritized backlog and §82 priority queue are buried. This new file extracts a focused, EV-ranked, ~200-line roadmap for shipping MODEL-2. Sections: 1. Ship goal (HF artifact + HumanEval pass@1 + 10 AC-SHIP2-* falsifiers) 2. Current state (§82 snapshot — val_loss 4.71, best ckpt path, sample quality) 3. AC-SHIP2-* status table (3 DISCHARGED · 1 FUNCTIONAL · 1 UNBLOCKED · 2 PARTIAL · 3 NOT-YET = 79%) 4. Open EV-ranked work queue — P0/P1/P2/P3 with Δship × effort × P(success) 5. Methodology lessons in flight (#24-#29 from §77-§82) 6. Bounded path to 100% with a 4-week shipping plan 7. Compute lanes for the queue (lambda-vector / gx10 / yoga / jetson) 8. How to update this roadmap (move-to-closed pattern, when to amend the full spec) Index file updated to flag the roadmap as the active-work spec ("read this for what to do next") to distinguish from the historical record. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): §83 + roadmap v2.0.0 — external audit pre-falsifies P2-A2, promote P2-C An external audit applied Chinchilla math (Hoffmann et al. 2022) to the v1.0.0 roadmap's P2-A2 dispatch plan and pre-falsified it BEFORE the run: N (Qwen-0.5B init) ≈ 494M params §82 P2-A consumed D ≈ 22M tokens Chinchilla compute-optimal D = 20·N = 9.88B Empirical ratio = 0.04× (catastrophically under-provisioned) Full available qwen-v2 corpus (1.24B) only reaches 0.125× The val_loss=4.71 plateau + repetitive `č č č č` gibberish are the Holtzman et al. 2019 neural text degeneration signature — binding constraint is data diversity, not compute. P2-A2 (more steps on same data) cannot break the plateau. Four engineering actions (audit Rec 1-4): 1. P2-C (widen corpus to > 2B tokens via the-stack-v2 + codeparrot) is now the highest-EV dispatch; P2-A2 is downgraded to fallback. 2. P0-J NEW item: convert Chinchilla gate from warning (PR #1708) to hard blocker (fail-fast at D/N < 10× unless --force-under-provisioned). 3. P1-B/C/P3-A deferred until val_loss < 3.0 (was 4.0). Perplexity > 20 means no zero-shot reasoning capability — wasted eval compute. 4. Methodology lesson #30: a-priori theoretical falsification saves compute. Symmetric complement to #18 predict-then-verify. Changes: - audits/albor-370.md — external audit text (preserved verbatim, added by reviewer) - albor-370m-roadmap.md v2.0.0 — audit-driven reprioritization section, P2-C promoted, P2-A2 downgraded with pre-falsification notice, P0-J added, P1-B/C deferred to val_loss < 3.0, 4-week plan rewritten (week 1 is data engineering, NOT training dispatch). - ship-model-2-spec.md §83 — historical record of the pre-falsification, Five-Whys on EV-rank failure mode, methodology lesson #30 documented, ship % stays at 79 pending P0-I/J + P2-C. Memory: - feedback_a_priori_theoretical_falsification.md — new lesson #30 with 4-check pre-flight template for `apr pretrain`. - MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(roadmap): dispatch §83/v2.0.0 items as pmat work tickets (PMAT-679..689) Created 10 tickets in docs/roadmaps/roadmap.yaml for the post-audit albor-370m roadmap items: PMAT-679 P0-I critical Verify P0-G+P0-H end-to-end PARTIAL ✓ PMAT-680 P0-J critical Chinchilla gate hard blocker PMAT-681 P2-C critical Widen corpus to >2B tokens (HIGHEST EV) PMAT-682 P2-A2 low Same-corpus longer run (FALLBACK only) PMAT-683 P2-D medium True distillation from MODEL-1 PMAT-684 P1-B medium HumanEval pass@1 (deferred until val_loss<3.0) PMAT-685 P1-C medium Python validity 100 prompts (deferred) PMAT-686 P3-A medium apr inspect --quality ≥ 90 (deferred) PMAT-687 P3-B medium apr lint zero High severity PARTIAL ✓ PMAT-688 P3-C medium Publish to HuggingFace PMAT-689 P3-D medium Post-publish QA + /dogfood PARTIAL discharges this turn: PMAT-679 P0-I: P0-G verified live via re-export of epoch-020.apr — the `[P0-G] Padding APR-fallback tokenizer.ggml.tokens: 151643 + 293 placeholders = 151936` message fires and GGUF metadata + tensor shapes align at 151936. P0-H NOT exercised on this checkpoint (it was trained BEFORE P0-H landed so its arch metadata is still LlamaForCausalLM → Qwen2 biases leak as passthrough → llama-cli expected 291 got 219). P0-H verification deferred to PMAT-681 (P2-C) since exercising it requires a freshly-emitted checkpoint. System memory was critically low (3GB free / 127GB swap exhausted), preventing the rebuild that would have allowed in-flight verification. PMAT-687 P3-B: apr lint on epoch-020.apr returns 0 errors / 3 warnings / 1 info. Meets the "zero High severity" criterion for AC-SHIP2-008. Open warnings (license, model_card, provenance) require pretrain-side metadata stamping (relates to AC-SHIP2-022) and a model-card author step. Evidence: evidence/p0-i-2026-05-16/findings.md (+ 3 raw logs) evidence/p3-b-2026-05-16-lint.txt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 15, 2026 22:20

noahgift added 2 commits May 16, 2026 00:44

Merge branch 'main' into feat/p1a-chinchilla-gate

bd6e290

Merge branch 'main' into feat/p1a-chinchilla-gate

84e5a57

noahgift merged commit 1ac3340 into main May 16, 2026
10 checks passed

noahgift deleted the feat/p1a-chinchilla-gate branch May 16, 2026 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pretrain): SPEC §82 P1-A — Chinchilla compute-optimal gate warning#1708

feat(pretrain): SPEC §82 P1-A — Chinchilla compute-optimal gate warning#1708
noahgift merged 3 commits into
mainfrom
feat/p1a-chinchilla-gate

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Summary

Motivation

Test plan

Backward compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant