Skip to content

feat(pretrain): SPEC §82 P1-A — Chinchilla compute-optimal gate warning#1708

Merged
noahgift merged 3 commits into
mainfrom
feat/p1a-chinchilla-gate
May 16, 2026
Merged

feat(pretrain): SPEC §82 P1-A — Chinchilla compute-optimal gate warning#1708
noahgift merged 3 commits into
mainfrom
feat/p1a-chinchilla-gate

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

When apr pretrain --init <apr> runs, compute param count N from the init model's arch dims and check it against train tokens D = num_steps × batch_size × seq_length. Emit a stderr warning when D is below Chinchilla compute-optimal target (D ≈ 20·N per arXiv:2203.15556).

  • D < 5·N → SEVERE: model will memorize, not generalize
  • D < 20·N → BELOW-OPTIMAL: model has room for more training

Non-fatal warning with a suggested --num-steps value. Triggered only on --init paths.

Discharges §82's P1-A item (Δship +1, prevention, ~75 LOC).

Motivation

SHIP-TWO-001 has spent multiple sessions debugging convergence failures that turned out to be under-training (most recently §82's val_loss=4.7111 plateau on 2700 steps of a 500M-param model — D ≈ 22M tokens vs Chinchilla target 10B = 0.2% of compute-optimal). A startup-time warning surfaces this immediately instead of after a 40-min compute burn.

Test plan

  • cargo test -p apr-cli --lib estimate_param_count → 2/2 PASS
  • estimate_param_count_qwen2_05b_within_2x validates the formula gives ~494M for Qwen2.5-0.5B dims
  • estimate_param_count_scales_with_layers validates monotonic scaling with depth
  • cargo build -p apr-cli --bin apr succeeds

Backward compatibility

Pure additive — only emits to stderr when --init is provided AND ratio is below threshold. No behaviour change for compute-optimal or from-scratch runs.

🤖 Generated with Claude Code

When `apr pretrain --init <apr>` runs, compute the param count N from
the init model's arch dims and check it against train tokens
D = num_steps × batch_size × seq_length.

Per Chinchilla (arXiv:2203.15556), compute-optimal pretraining requires
D ≈ 20·N. Two warning thresholds:

- D < 5·N  → SEVERE: model will memorize, not generalize
- D < 20·N → BELOW-OPTIMAL: model has room for more training

Non-fatal — operators may have legitimate reasons to deviate (resume
runs, ablation studies). The warning includes a suggested `--num-steps`
value to reach 20·N.

Triggered only on --init paths (from-scratch synthetic runs are
exempt — operator knows what they're doing).

Test plan:
- estimate_param_count() with Qwen2.5-0.5B dims gives within 2× of 494M
- estimator scales appropriately with num_hidden_layers
- 2/2 P1-A tests PASS

Discharges §82 P1-A item (Δship +1, prevention, ~75 LOC).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 22:20
@noahgift noahgift merged commit 1ac3340 into main May 16, 2026
10 checks passed
@noahgift noahgift deleted the feat/p1a-chinchilla-gate branch May 16, 2026 00:17
noahgift added a commit that referenced this pull request May 16, 2026
…, promote P2-C

An external audit applied Chinchilla math (Hoffmann et al. 2022) to the
v1.0.0 roadmap's P2-A2 dispatch plan and pre-falsified it BEFORE the run:

  N (Qwen-0.5B init) ≈ 494M params
  §82 P2-A consumed D ≈ 22M tokens
  Chinchilla compute-optimal D = 20·N = 9.88B
  Empirical ratio = 0.04×  (catastrophically under-provisioned)
  Full available qwen-v2 corpus (1.24B) only reaches 0.125×

The val_loss=4.71 plateau + repetitive `č č č č` gibberish are the
Holtzman et al. 2019 neural text degeneration signature — binding
constraint is data diversity, not compute. P2-A2 (more steps on same
data) cannot break the plateau.

Four engineering actions (audit Rec 1-4):

1. P2-C (widen corpus to > 2B tokens via the-stack-v2 + codeparrot) is
   now the highest-EV dispatch; P2-A2 is downgraded to fallback.
2. P0-J NEW item: convert Chinchilla gate from warning (PR #1708) to
   hard blocker (fail-fast at D/N < 10× unless --force-under-provisioned).
3. P1-B/C/P3-A deferred until val_loss < 3.0 (was 4.0). Perplexity > 20
   means no zero-shot reasoning capability — wasted eval compute.
4. Methodology lesson #30: a-priori theoretical falsification saves
   compute. Symmetric complement to #18 predict-then-verify.

Changes:

- audits/albor-370.md         — external audit text (preserved verbatim,
                                added by reviewer)
- albor-370m-roadmap.md v2.0.0 — audit-driven reprioritization section,
                                P2-C promoted, P2-A2 downgraded with
                                pre-falsification notice, P0-J added,
                                P1-B/C deferred to val_loss < 3.0,
                                4-week plan rewritten (week 1 is data
                                engineering, NOT training dispatch).
- ship-model-2-spec.md §83    — historical record of the pre-falsification,
                                Five-Whys on EV-rank failure mode,
                                methodology lesson #30 documented,
                                ship % stays at 79 pending P0-I/J + P2-C.

Memory:

- feedback_a_priori_theoretical_falsification.md — new lesson #30 with
  4-check pre-flight template for `apr pretrain`.
- MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 16, 2026
…iles (#1710)

* docs(spec): split SHIP-TWO-001 v3.28.0 into per-model + shared files

The 8,468-line ship-two-models-spec.md has accumulated 60+ sections with
MODEL-1 and MODEL-2 content interleaved chronologically. Per user request,
split into three companion files preserving original §N section markers
verbatim (so cross-references in git history, PR descriptions, memory
files, and contracts remain valid).

New layout:

  docs/specifications/aprender-train/
    ├── ship-two-models-spec.md       (45-line index, was 8468)
    ├── ship-model-1-spec.md          (3399 lines, MODEL-1 specific)
    ├── ship-model-2-spec.md          (3290 lines, MODEL-2 specific)
    └── ship-shared-methodology.md    (1855 lines, foundation + cross-cutting)

Classification:
- MODEL-1 (27 sections): §4 base, §7.1, §12 expedited, §15-§17/§23/§27/
  §30-§32/§40/§46-§48 SHIP-007 chain, §58 release, §61, §63, §67-§71
  SHIP-005, §72 5-AC cascade, §73-§74 LM head, §75 100%, §76 v0.33.0.
- MODEL-2 (28 sections): §5 base, §7.2, §14 Task #132, §19-§20, §22,
  §24-§25 corpus, §26 P-plan, §33-§35 retrain+distill, §42-§43, §49
  pivot, §50-§57 §50.4 cascade, §77-§82 step 5g + P2-A.
- SHARED (17 sections): §1-§3 foundation, §6-§11, §13 retrospective,
  §18 status snapshot, §36 plain-language, §41/§44/§45 CPU-GPU parity.

Total content: 8,544 lines (original 8,468 + 76 lines of new file headers).
Zero content loss verified by section count: 72 classified × line ranges
sum to original file length.

Original v3.28.0 file recoverable from git via:
  git show b3ab72f^:docs/specifications/aprender-train/ship-two-models-spec.md

User decisions:
- 3-file layout (two specs + shared appendix)
- Original ship-two-models-spec.md replaced with 1-page index
- Original §N numbers preserved per file (non-contiguous within each)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): add lineage repo references for MODEL-1 + MODEL-2

Both models originated as standalone GitHub repos before APR-MONO
consolidation:

- MODEL-1 → paiml/apr-leaderboard (last commit 2026-04-05) — carries
  the original 28 distillation contracts that were promoted into the
  aprender monorepo.
- MODEL-2 → paiml/albor (last commit 2026-04-05) — carries 54/54
  authored contracts, the ALB-* ticket system, and the v28/v29 training
  history (v28 stopped at step 11K, peaked at perplexity 38.53).

Adds a "Lineage" subsection at the top of each per-model spec and a
"Repository lineage" table to the index. No content changes; pure
historical-provenance documentation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): rename MODEL-1/MODEL-2 to aprender-coder-7b / aprender-coder-370m

Adopt HuggingFace-style descriptive size+role names as the public model
identifiers while keeping MODEL-1 / MODEL-2 as stable numeric document
IDs (preserved across renames so PR/git/contract cross-references stay
valid).

- MODEL-1 → `aprender-coder-7b` (distilled 7B coder teacher)
- MODEL-2 → `aprender-coder-370m` (sovereign 370M Python student)

Codenames (`apr-leaderboard`, `albor`) remain in the lineage tables as
the historical repo names.

Per-file changes:
- ship-model-1-spec.md: title + companion-spec links + version 1.0.0→1.1.0
- ship-model-2-spec.md: title + companion-spec links + version 1.0.0→1.1.0
- ship-two-models-spec.md: new "Model identifiers" table + updated
  spec-layout + repository-lineage + section-ownership entries

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): clarify aprender-coder-* family naming convention vs HF redistribution slug

User question surfaced that the "isn't the convention to keep origin name"
intuition only applies to redistribution slugs (where Unsloth-style preserves
upstream lineage). At the spec/family level, multi-model authors like Mistral,
DeepSeek, Qwen, Microsoft Phi use a coherent family prefix — `mistral-7b`,
`deepseek-coder-1.3b`, `qwen2.5-coder-7b`, `phi-3.5-mini` — that acts as the
brand identity.

`aprender-coder-7b` and `aprender-coder-370m` follow this convention: family
prefix `aprender` + variant `coder` + size suffix. The HF redistribution slug
`paiml/qwen2.5-coder-7b-apache-q4k-v1` keeps upstream Qwen lineage in its name
(because MODEL-1 is a quantized/relicensed derivative). Both names point to
the same artifact at different levels of identity.

Updated the model identifiers table to:
- Add the "HF redistribution slug" column showing the relationship explicitly
- Add a "Naming convention" paragraph citing Mistral/DeepSeek/Qwen/Phi precedent
- Add a "Family vs. redistribution" paragraph explaining why both names exist

No content changes — pure clarification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): adopt Unsloth/Bartowski redistributor naming — aprender/{base}-{tags}

User clarified the convention: model authors who redistribute (Unsloth,
Bartowski, TheBloke) preserve upstream identity in their family name, not
strip it. Pattern is {org}/{upstream-base}-{license-tag}-{quant-tag} or
{org}/{codename}-{size} for sovereign work.

New family names:
- MODEL-1: aprender/qwen2.5-coder-7b-apache-q4k
  (was: aprender-coder-7b — dropped the Qwen lineage incorrectly)
- MODEL-2: aprender/albor-370m
  (was: aprender-coder-370m — kept original albor codename for sovereign work)

Examples cited in the spec:
- unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit
- bartowski/Qwen2.5-Coder-7B-Instruct-GGUF
- TheBloke/CodeLlama-7B-Instruct-GGUF

The HF artifact slug (paiml/...-v1) stays as the published handle — the
family name (aprender/...) is the spec-level identity. Both refer to the
same artifact.

Per-file changes:
- ship-model-1-spec.md: title + name + companion-spec link, v1.1.0→v1.2.0
- ship-model-2-spec.md: title + name + companion-spec link, v1.1.0→v1.2.0
- ship-two-models-spec.md: model identifiers table + naming convention
  paragraphs cite the Unsloth/Bartowski/TheBloke precedent; spec-layout,
  repository-lineage, and section-ownership entries updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): add albor-370m-roadmap.md — forward-looking active-work spec

The 3,290-line MODEL-2 spec is the historical record of §5-§82 amendments —
authoritative for what happened, but unwieldy as the working doc for what to
do next. The §80 prioritized backlog and §82 priority queue are buried.

This new file extracts a focused, EV-ranked, ~200-line roadmap for shipping
MODEL-2. Sections:

1. Ship goal (HF artifact + HumanEval pass@1 + 10 AC-SHIP2-* falsifiers)
2. Current state (§82 snapshot — val_loss 4.71, best ckpt path, sample quality)
3. AC-SHIP2-* status table (3 DISCHARGED · 1 FUNCTIONAL · 1 UNBLOCKED · 2 PARTIAL · 3 NOT-YET = 79%)
4. Open EV-ranked work queue — P0/P1/P2/P3 with Δship × effort × P(success)
5. Methodology lessons in flight (#24-#29 from §77-§82)
6. Bounded path to 100% with a 4-week shipping plan
7. Compute lanes for the queue (lambda-vector / gx10 / yoga / jetson)
8. How to update this roadmap (move-to-closed pattern, when to amend the full spec)

Index file updated to flag the roadmap as the active-work spec ("read this for
what to do next") to distinguish from the historical record.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): §83 + roadmap v2.0.0 — external audit pre-falsifies P2-A2, promote P2-C

An external audit applied Chinchilla math (Hoffmann et al. 2022) to the
v1.0.0 roadmap's P2-A2 dispatch plan and pre-falsified it BEFORE the run:

  N (Qwen-0.5B init) ≈ 494M params
  §82 P2-A consumed D ≈ 22M tokens
  Chinchilla compute-optimal D = 20·N = 9.88B
  Empirical ratio = 0.04×  (catastrophically under-provisioned)
  Full available qwen-v2 corpus (1.24B) only reaches 0.125×

The val_loss=4.71 plateau + repetitive `č č č č` gibberish are the
Holtzman et al. 2019 neural text degeneration signature — binding
constraint is data diversity, not compute. P2-A2 (more steps on same
data) cannot break the plateau.

Four engineering actions (audit Rec 1-4):

1. P2-C (widen corpus to > 2B tokens via the-stack-v2 + codeparrot) is
   now the highest-EV dispatch; P2-A2 is downgraded to fallback.
2. P0-J NEW item: convert Chinchilla gate from warning (PR #1708) to
   hard blocker (fail-fast at D/N < 10× unless --force-under-provisioned).
3. P1-B/C/P3-A deferred until val_loss < 3.0 (was 4.0). Perplexity > 20
   means no zero-shot reasoning capability — wasted eval compute.
4. Methodology lesson #30: a-priori theoretical falsification saves
   compute. Symmetric complement to #18 predict-then-verify.

Changes:

- audits/albor-370.md         — external audit text (preserved verbatim,
                                added by reviewer)
- albor-370m-roadmap.md v2.0.0 — audit-driven reprioritization section,
                                P2-C promoted, P2-A2 downgraded with
                                pre-falsification notice, P0-J added,
                                P1-B/C deferred to val_loss < 3.0,
                                4-week plan rewritten (week 1 is data
                                engineering, NOT training dispatch).
- ship-model-2-spec.md §83    — historical record of the pre-falsification,
                                Five-Whys on EV-rank failure mode,
                                methodology lesson #30 documented,
                                ship % stays at 79 pending P0-I/J + P2-C.

Memory:

- feedback_a_priori_theoretical_falsification.md — new lesson #30 with
  4-check pre-flight template for `apr pretrain`.
- MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(roadmap): dispatch §83/v2.0.0 items as pmat work tickets (PMAT-679..689)

Created 10 tickets in docs/roadmaps/roadmap.yaml for the post-audit
albor-370m roadmap items:

  PMAT-679 P0-I  critical  Verify P0-G+P0-H end-to-end          PARTIAL ✓
  PMAT-680 P0-J  critical  Chinchilla gate hard blocker
  PMAT-681 P2-C  critical  Widen corpus to >2B tokens (HIGHEST EV)
  PMAT-682 P2-A2 low       Same-corpus longer run (FALLBACK only)
  PMAT-683 P2-D  medium    True distillation from MODEL-1
  PMAT-684 P1-B  medium    HumanEval pass@1 (deferred until val_loss<3.0)
  PMAT-685 P1-C  medium    Python validity 100 prompts (deferred)
  PMAT-686 P3-A  medium    apr inspect --quality ≥ 90 (deferred)
  PMAT-687 P3-B  medium    apr lint zero High severity         PARTIAL ✓
  PMAT-688 P3-C  medium    Publish to HuggingFace
  PMAT-689 P3-D  medium    Post-publish QA + /dogfood

PARTIAL discharges this turn:

PMAT-679 P0-I: P0-G verified live via re-export of epoch-020.apr — the
`[P0-G] Padding APR-fallback tokenizer.ggml.tokens: 151643 + 293
placeholders = 151936` message fires and GGUF metadata + tensor shapes
align at 151936. P0-H NOT exercised on this checkpoint (it was trained
BEFORE P0-H landed so its arch metadata is still LlamaForCausalLM →
Qwen2 biases leak as passthrough → llama-cli expected 291 got 219).
P0-H verification deferred to PMAT-681 (P2-C) since exercising it
requires a freshly-emitted checkpoint. System memory was critically
low (3GB free / 127GB swap exhausted), preventing the rebuild that
would have allowed in-flight verification.

PMAT-687 P3-B: apr lint on epoch-020.apr returns 0 errors / 3 warnings
/ 1 info. Meets the "zero High severity" criterion for AC-SHIP2-008.
Open warnings (license, model_card, provenance) require pretrain-side
metadata stamping (relates to AC-SHIP2-022) and a model-card author
step.

Evidence:
  evidence/p0-i-2026-05-16/findings.md (+ 3 raw logs)
  evidence/p3-b-2026-05-16-lint.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant