feat(pretrain): SPEC §82 P1-A — Chinchilla compute-optimal gate warning#1708
Merged
Conversation
When `apr pretrain --init <apr>` runs, compute the param count N from the init model's arch dims and check it against train tokens D = num_steps × batch_size × seq_length. Per Chinchilla (arXiv:2203.15556), compute-optimal pretraining requires D ≈ 20·N. Two warning thresholds: - D < 5·N → SEVERE: model will memorize, not generalize - D < 20·N → BELOW-OPTIMAL: model has room for more training Non-fatal — operators may have legitimate reasons to deviate (resume runs, ablation studies). The warning includes a suggested `--num-steps` value to reach 20·N. Triggered only on --init paths (from-scratch synthetic runs are exempt — operator knows what they're doing). Test plan: - estimate_param_count() with Qwen2.5-0.5B dims gives within 2× of 494M - estimator scales appropriately with num_hidden_layers - 2/2 P1-A tests PASS Discharges §82 P1-A item (Δship +1, prevention, ~75 LOC). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 16, 2026
…, promote P2-C An external audit applied Chinchilla math (Hoffmann et al. 2022) to the v1.0.0 roadmap's P2-A2 dispatch plan and pre-falsified it BEFORE the run: N (Qwen-0.5B init) ≈ 494M params §82 P2-A consumed D ≈ 22M tokens Chinchilla compute-optimal D = 20·N = 9.88B Empirical ratio = 0.04× (catastrophically under-provisioned) Full available qwen-v2 corpus (1.24B) only reaches 0.125× The val_loss=4.71 plateau + repetitive `č č č č` gibberish are the Holtzman et al. 2019 neural text degeneration signature — binding constraint is data diversity, not compute. P2-A2 (more steps on same data) cannot break the plateau. Four engineering actions (audit Rec 1-4): 1. P2-C (widen corpus to > 2B tokens via the-stack-v2 + codeparrot) is now the highest-EV dispatch; P2-A2 is downgraded to fallback. 2. P0-J NEW item: convert Chinchilla gate from warning (PR #1708) to hard blocker (fail-fast at D/N < 10× unless --force-under-provisioned). 3. P1-B/C/P3-A deferred until val_loss < 3.0 (was 4.0). Perplexity > 20 means no zero-shot reasoning capability — wasted eval compute. 4. Methodology lesson #30: a-priori theoretical falsification saves compute. Symmetric complement to #18 predict-then-verify. Changes: - audits/albor-370.md — external audit text (preserved verbatim, added by reviewer) - albor-370m-roadmap.md v2.0.0 — audit-driven reprioritization section, P2-C promoted, P2-A2 downgraded with pre-falsification notice, P0-J added, P1-B/C deferred to val_loss < 3.0, 4-week plan rewritten (week 1 is data engineering, NOT training dispatch). - ship-model-2-spec.md §83 — historical record of the pre-falsification, Five-Whys on EV-rank failure mode, methodology lesson #30 documented, ship % stays at 79 pending P0-I/J + P2-C. Memory: - feedback_a_priori_theoretical_falsification.md — new lesson #30 with 4-check pre-flight template for `apr pretrain`. - MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 16, 2026
…iles (#1710) * docs(spec): split SHIP-TWO-001 v3.28.0 into per-model + shared files The 8,468-line ship-two-models-spec.md has accumulated 60+ sections with MODEL-1 and MODEL-2 content interleaved chronologically. Per user request, split into three companion files preserving original §N section markers verbatim (so cross-references in git history, PR descriptions, memory files, and contracts remain valid). New layout: docs/specifications/aprender-train/ ├── ship-two-models-spec.md (45-line index, was 8468) ├── ship-model-1-spec.md (3399 lines, MODEL-1 specific) ├── ship-model-2-spec.md (3290 lines, MODEL-2 specific) └── ship-shared-methodology.md (1855 lines, foundation + cross-cutting) Classification: - MODEL-1 (27 sections): §4 base, §7.1, §12 expedited, §15-§17/§23/§27/ §30-§32/§40/§46-§48 SHIP-007 chain, §58 release, §61, §63, §67-§71 SHIP-005, §72 5-AC cascade, §73-§74 LM head, §75 100%, §76 v0.33.0. - MODEL-2 (28 sections): §5 base, §7.2, §14 Task #132, §19-§20, §22, §24-§25 corpus, §26 P-plan, §33-§35 retrain+distill, §42-§43, §49 pivot, §50-§57 §50.4 cascade, §77-§82 step 5g + P2-A. - SHARED (17 sections): §1-§3 foundation, §6-§11, §13 retrospective, §18 status snapshot, §36 plain-language, §41/§44/§45 CPU-GPU parity. Total content: 8,544 lines (original 8,468 + 76 lines of new file headers). Zero content loss verified by section count: 72 classified × line ranges sum to original file length. Original v3.28.0 file recoverable from git via: git show b3ab72f^:docs/specifications/aprender-train/ship-two-models-spec.md User decisions: - 3-file layout (two specs + shared appendix) - Original ship-two-models-spec.md replaced with 1-page index - Original §N numbers preserved per file (non-contiguous within each) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): add lineage repo references for MODEL-1 + MODEL-2 Both models originated as standalone GitHub repos before APR-MONO consolidation: - MODEL-1 → paiml/apr-leaderboard (last commit 2026-04-05) — carries the original 28 distillation contracts that were promoted into the aprender monorepo. - MODEL-2 → paiml/albor (last commit 2026-04-05) — carries 54/54 authored contracts, the ALB-* ticket system, and the v28/v29 training history (v28 stopped at step 11K, peaked at perplexity 38.53). Adds a "Lineage" subsection at the top of each per-model spec and a "Repository lineage" table to the index. No content changes; pure historical-provenance documentation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): rename MODEL-1/MODEL-2 to aprender-coder-7b / aprender-coder-370m Adopt HuggingFace-style descriptive size+role names as the public model identifiers while keeping MODEL-1 / MODEL-2 as stable numeric document IDs (preserved across renames so PR/git/contract cross-references stay valid). - MODEL-1 → `aprender-coder-7b` (distilled 7B coder teacher) - MODEL-2 → `aprender-coder-370m` (sovereign 370M Python student) Codenames (`apr-leaderboard`, `albor`) remain in the lineage tables as the historical repo names. Per-file changes: - ship-model-1-spec.md: title + companion-spec links + version 1.0.0→1.1.0 - ship-model-2-spec.md: title + companion-spec links + version 1.0.0→1.1.0 - ship-two-models-spec.md: new "Model identifiers" table + updated spec-layout + repository-lineage + section-ownership entries Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): clarify aprender-coder-* family naming convention vs HF redistribution slug User question surfaced that the "isn't the convention to keep origin name" intuition only applies to redistribution slugs (where Unsloth-style preserves upstream lineage). At the spec/family level, multi-model authors like Mistral, DeepSeek, Qwen, Microsoft Phi use a coherent family prefix — `mistral-7b`, `deepseek-coder-1.3b`, `qwen2.5-coder-7b`, `phi-3.5-mini` — that acts as the brand identity. `aprender-coder-7b` and `aprender-coder-370m` follow this convention: family prefix `aprender` + variant `coder` + size suffix. The HF redistribution slug `paiml/qwen2.5-coder-7b-apache-q4k-v1` keeps upstream Qwen lineage in its name (because MODEL-1 is a quantized/relicensed derivative). Both names point to the same artifact at different levels of identity. Updated the model identifiers table to: - Add the "HF redistribution slug" column showing the relationship explicitly - Add a "Naming convention" paragraph citing Mistral/DeepSeek/Qwen/Phi precedent - Add a "Family vs. redistribution" paragraph explaining why both names exist No content changes — pure clarification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): adopt Unsloth/Bartowski redistributor naming — aprender/{base}-{tags} User clarified the convention: model authors who redistribute (Unsloth, Bartowski, TheBloke) preserve upstream identity in their family name, not strip it. Pattern is {org}/{upstream-base}-{license-tag}-{quant-tag} or {org}/{codename}-{size} for sovereign work. New family names: - MODEL-1: aprender/qwen2.5-coder-7b-apache-q4k (was: aprender-coder-7b — dropped the Qwen lineage incorrectly) - MODEL-2: aprender/albor-370m (was: aprender-coder-370m — kept original albor codename for sovereign work) Examples cited in the spec: - unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit - bartowski/Qwen2.5-Coder-7B-Instruct-GGUF - TheBloke/CodeLlama-7B-Instruct-GGUF The HF artifact slug (paiml/...-v1) stays as the published handle — the family name (aprender/...) is the spec-level identity. Both refer to the same artifact. Per-file changes: - ship-model-1-spec.md: title + name + companion-spec link, v1.1.0→v1.2.0 - ship-model-2-spec.md: title + name + companion-spec link, v1.1.0→v1.2.0 - ship-two-models-spec.md: model identifiers table + naming convention paragraphs cite the Unsloth/Bartowski/TheBloke precedent; spec-layout, repository-lineage, and section-ownership entries updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): add albor-370m-roadmap.md — forward-looking active-work spec The 3,290-line MODEL-2 spec is the historical record of §5-§82 amendments — authoritative for what happened, but unwieldy as the working doc for what to do next. The §80 prioritized backlog and §82 priority queue are buried. This new file extracts a focused, EV-ranked, ~200-line roadmap for shipping MODEL-2. Sections: 1. Ship goal (HF artifact + HumanEval pass@1 + 10 AC-SHIP2-* falsifiers) 2. Current state (§82 snapshot — val_loss 4.71, best ckpt path, sample quality) 3. AC-SHIP2-* status table (3 DISCHARGED · 1 FUNCTIONAL · 1 UNBLOCKED · 2 PARTIAL · 3 NOT-YET = 79%) 4. Open EV-ranked work queue — P0/P1/P2/P3 with Δship × effort × P(success) 5. Methodology lessons in flight (#24-#29 from §77-§82) 6. Bounded path to 100% with a 4-week shipping plan 7. Compute lanes for the queue (lambda-vector / gx10 / yoga / jetson) 8. How to update this roadmap (move-to-closed pattern, when to amend the full spec) Index file updated to flag the roadmap as the active-work spec ("read this for what to do next") to distinguish from the historical record. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): §83 + roadmap v2.0.0 — external audit pre-falsifies P2-A2, promote P2-C An external audit applied Chinchilla math (Hoffmann et al. 2022) to the v1.0.0 roadmap's P2-A2 dispatch plan and pre-falsified it BEFORE the run: N (Qwen-0.5B init) ≈ 494M params §82 P2-A consumed D ≈ 22M tokens Chinchilla compute-optimal D = 20·N = 9.88B Empirical ratio = 0.04× (catastrophically under-provisioned) Full available qwen-v2 corpus (1.24B) only reaches 0.125× The val_loss=4.71 plateau + repetitive `č č č č` gibberish are the Holtzman et al. 2019 neural text degeneration signature — binding constraint is data diversity, not compute. P2-A2 (more steps on same data) cannot break the plateau. Four engineering actions (audit Rec 1-4): 1. P2-C (widen corpus to > 2B tokens via the-stack-v2 + codeparrot) is now the highest-EV dispatch; P2-A2 is downgraded to fallback. 2. P0-J NEW item: convert Chinchilla gate from warning (PR #1708) to hard blocker (fail-fast at D/N < 10× unless --force-under-provisioned). 3. P1-B/C/P3-A deferred until val_loss < 3.0 (was 4.0). Perplexity > 20 means no zero-shot reasoning capability — wasted eval compute. 4. Methodology lesson #30: a-priori theoretical falsification saves compute. Symmetric complement to #18 predict-then-verify. Changes: - audits/albor-370.md — external audit text (preserved verbatim, added by reviewer) - albor-370m-roadmap.md v2.0.0 — audit-driven reprioritization section, P2-C promoted, P2-A2 downgraded with pre-falsification notice, P0-J added, P1-B/C deferred to val_loss < 3.0, 4-week plan rewritten (week 1 is data engineering, NOT training dispatch). - ship-model-2-spec.md §83 — historical record of the pre-falsification, Five-Whys on EV-rank failure mode, methodology lesson #30 documented, ship % stays at 79 pending P0-I/J + P2-C. Memory: - feedback_a_priori_theoretical_falsification.md — new lesson #30 with 4-check pre-flight template for `apr pretrain`. - MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(roadmap): dispatch §83/v2.0.0 items as pmat work tickets (PMAT-679..689) Created 10 tickets in docs/roadmaps/roadmap.yaml for the post-audit albor-370m roadmap items: PMAT-679 P0-I critical Verify P0-G+P0-H end-to-end PARTIAL ✓ PMAT-680 P0-J critical Chinchilla gate hard blocker PMAT-681 P2-C critical Widen corpus to >2B tokens (HIGHEST EV) PMAT-682 P2-A2 low Same-corpus longer run (FALLBACK only) PMAT-683 P2-D medium True distillation from MODEL-1 PMAT-684 P1-B medium HumanEval pass@1 (deferred until val_loss<3.0) PMAT-685 P1-C medium Python validity 100 prompts (deferred) PMAT-686 P3-A medium apr inspect --quality ≥ 90 (deferred) PMAT-687 P3-B medium apr lint zero High severity PARTIAL ✓ PMAT-688 P3-C medium Publish to HuggingFace PMAT-689 P3-D medium Post-publish QA + /dogfood PARTIAL discharges this turn: PMAT-679 P0-I: P0-G verified live via re-export of epoch-020.apr — the `[P0-G] Padding APR-fallback tokenizer.ggml.tokens: 151643 + 293 placeholders = 151936` message fires and GGUF metadata + tensor shapes align at 151936. P0-H NOT exercised on this checkpoint (it was trained BEFORE P0-H landed so its arch metadata is still LlamaForCausalLM → Qwen2 biases leak as passthrough → llama-cli expected 291 got 219). P0-H verification deferred to PMAT-681 (P2-C) since exercising it requires a freshly-emitted checkpoint. System memory was critically low (3GB free / 127GB swap exhausted), preventing the rebuild that would have allowed in-flight verification. PMAT-687 P3-B: apr lint on epoch-020.apr returns 0 errors / 3 warnings / 1 info. Meets the "zero High severity" criterion for AC-SHIP2-008. Open warnings (license, model_card, provenance) require pretrain-side metadata stamping (relates to AC-SHIP2-022) and a model-card author step. Evidence: evidence/p0-i-2026-05-16/findings.md (+ 3 raw logs) evidence/p3-b-2026-05-16-lint.txt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
apr pretrain --init <apr>runs, compute param count N from the init model's arch dims and check it against train tokens D = num_steps × batch_size × seq_length. Emit a stderr warning when D is below Chinchilla compute-optimal target (D ≈ 20·N per arXiv:2203.15556).Non-fatal warning with a suggested
--num-stepsvalue. Triggered only on--initpaths.Discharges §82's P1-A item (Δship +1, prevention, ~75 LOC).
Motivation
SHIP-TWO-001 has spent multiple sessions debugging convergence failures that turned out to be under-training (most recently §82's val_loss=4.7111 plateau on 2700 steps of a 500M-param model — D ≈ 22M tokens vs Chinchilla target 10B = 0.2% of compute-optimal). A startup-time warning surfaces this immediately instead of after a 40-min compute burn.
Test plan
cargo test -p apr-cli --lib estimate_param_count→ 2/2 PASSestimate_param_count_qwen2_05b_within_2xvalidates the formula gives ~494M for Qwen2.5-0.5B dimsestimate_param_count_scales_with_layersvalidates monotonic scaling with depthcargo build -p apr-cli --bin aprsucceedsBackward compatibility
Pure additive — only emits to stderr when
--initis provided AND ratio is below threshold. No behaviour change for compute-optimal or from-scratch runs.🤖 Generated with Claude Code