fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001)#1585
Merged
Conversation
…PMAT-CODE-TOKENIZE-BPE-FORMAT-001) Closes the silent-`<unk>` defect class that produced SHIP-TWO §60's val_loss=0.00081 anomaly recorded in PR #1580. ROOT CAUSE ========== aprender-train's `BPETokenizer::to_bytes` (line 117) emits HEX-string representations: byte 'd' (0x64) → "64", byte 'e' → "65", etc. The loaded vocab.json must have these hex strings as keys for encoding to work. `apr tokenize import-hf` (used by SHIP-TWO §54-§56 step 5g.0 to extract Qwen2.5-Coder-0.5B-Instruct's tokenizer) emits HuggingFace GPT-2 byte-level format: tokens like "Ġdef", "Ġreturn", "def" with Ġ-prefix for spaces and raw characters. **NO hex strings.** When `apr tokenize encode-corpus` then loaded this vocab via `from_vocab_merges`, the load succeeded silently. Subsequent encoding pipeline: 1. `to_bytes("def")` → ["64", "65", "66"] (hex) 2. `apply_merges` looks up these in Qwen vocab — never found 3. `vocab.get("64")` returns None 4. Fallback to `unk_id` (line 275) 5. ALL bytes become `<unk>` Empirical verification (this branch, lambda-vector RTX 4090): - Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin - First 32K tokens (= 16 batches × 4 sequences × 513 tokens): 99.99% token 128244 (`<unk>`) 0.01% token 128247 (`</s>`) Shannon entropy: 0.001 bits / 17.21 bits theoretical max - All 228 shards confirmed similarly degenerate (~0.003 bits each) Five-Whys ========= 1. Why was val_loss=0.00081 implausibly low (PR #1580)? Because the trained model just learned to predict `<unk>` always — and the held-out batches were 99.99% `<unk>`. cross-entropy on monotonous labels ≈ 0. 2. Why is the corpus 99.99% `<unk>`? Because `apr tokenize encode-corpus` silently emitted `<unk>` for every byte it couldn't find in the loaded vocab. 3. Why couldn't it find anything? Because `to_bytes` produces hex strings ("64") but the Qwen vocab uses GPT-2 byte-level format (raw chars + Ġ-prefix). Format mismatch. 4. Why did the load succeed silently? Because `from_vocab_merges` only checked structural correctness (every merged token in vocab) but NOT format consistency. The vocab format matters because `to_bytes`'s output must match vocab keys. 5. Why didn't existing falsifiers catch this? Because they're between-contracts: `apr-cli-tokenize-import-hf-v1` guarantees import is byte-correct; `pretokenize-bin-v1` guarantees output is u32 stream — but neither pins "encoder's tokenization scheme matches imported vocab's tokenization scheme." Closing that gap with this PR's fail-fast. FIX (smallest viable, fail-fast) ================================= In `BPETokenizer::from_vocab_merges`, after loading vocab.json, count how many of the canonical 256 hex-byte tokens "00".."ff" exist in the vocab. A legitimate hex-byte vocab from `apr tokenize train` always has all 256 (allocated during `init_vocab`). If fewer than 200 are present, the vocab is in the wrong format and the loader returns Err with FALSIFY-BPE-FORMAT-MISMATCH-001 citation, naming the cause and pointing to the canonical fix (implement Ġ-prefix encoding in a follow-up). This is a fail-CLOSED guard: silently corrupting a corpus is worse than refusing to run. The operator now sees a clear actionable error instead of producing a 17-hour broken corpus. LIVE EVIDENCE ============= $ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ... error: Validation failed: Cannot load tokenizer: Serialization error: FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at /tmp/qwen-0.5b-tokenizer-extracted/vocab.json contains only 36/256 canonical hex-byte tokens ("00".."ff"), below the 200 threshold. aprender-train's BPETokenizer uses HEX-BYTE format internally... The exact Qwen vocab that produced the broken 5g.1 corpus now fails-fast on the canonical 36/256 hex-byte signature. Falsifier test ============== `falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast`: - Synthesizes a tiny GPT-2-style vocab.json (raw chars + Ġ-prefix, NO hex bytes) on disk - Calls `BPETokenizer::from_vocab_merges` - Asserts: - result is Err - error message cites "FALSIFY-BPE-FORMAT-MISMATCH-001" - error message mentions "hex-byte" format - error message names `apr tokenize import-hf` (operator diagnostic clarity) RED on main pre-fix; GREEN with this PR. Updated existing test ===================== `test_bpe_from_vocab_merges_rejects_orphan_merge` was implicitly relying on a 3-token vocab; the new fail-fast fires before its orphan-merge check. Updated the test's vocab to include the 256 hex-byte alphabet so the format check passes and the orphan-merge check still fires (existing behavior preserved). Quality gates (all green) ========================== - cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier) - cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS - cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with clear error (verified on lambda-vector RTX 4090) SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — but the path forward is now unblocked. The 5g.1 corpus is INVALID (99.99% `<unk>`); a fix for PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (Ġ-prefix encoding) would let `apr tokenize encode-corpus` produce a real Python corpus, and re-running 5g.1 + 5g.2 would produce HONEST val_loss numbers in the plausible 1.5-2.5 range. - §50.4 cascade: COMPLETE per #1577. The bug surfaced here is upstream in tokenization, not in any §50.4 step. - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) but the CORRECT-DATA path requires PMAT-CODE-TOKENIZE-BPE-FORMAT-001 to land first. Out-of-scope follow-ups ======================== PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (multi-PR cascade): - Implement Ġ-prefix byte-level encoding in `BPETokenizer` (the canonical fix; ~150 LOC + tests). - OR add a parallel `Gpt2BpeTokenizer` that aprender-train's encode-corpus dispatches to based on vocab format detection. - Re-tokenize the 5g.1 corpus with the working encoder; verify Shannon entropy > 10 bits. - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
noahgift
added a commit
that referenced
this pull request
May 9, 2026
…ODE-TOKENIZE-BPE-FORMAT-001) (#1596) Builds on PR #1585's fail-fast load-time format detection. When `apr tokenize encode-corpus` receives a vocab in GPT-2 byte-level format (i.e., from `apr tokenize import-hf` of Qwen2/Llama2/Mistral) that fails the hex-byte loader with FALSIFY-BPE-FORMAT-MISMATCH-001, this PR routes through `aprender::text::bpe::BpeTokenizer` (the proper byte-level encoder) instead of returning the fail-fast error. Three-way load priority: 1. Hex-byte loader (BPETokenizer::from_vocab_merges) — for vocabs trained by `apr tokenize train` (legacy 50257-vocab codeparrot path). 2. tokenizer.json (aprender::text::bpe::load_from_json) — when a sibling tokenizer.json exists in the dir, prefer the canonical HuggingFace format. 3. vocab.json + merges.txt (aprender::text::bpe::load_from_files) — fallback when only the import-hf-extracted pair exists. LIVE EVIDENCE (lambda-vector RTX 4090, 100-doc Python smoke) ============================================================= Hex-format vocab (model-2-tokenizer-v1, vocab=50257): UNCHANGED — entropy 12.009 bits, 13304 distinct tokens. Confirms regression-free for the legacy 5g.1-pre path. GPT-2 byte-level vocab (Qwen2.5-Coder, vocab=151643): BEFORE this PR: 99.99% `<unk>`, entropy 0.001 bits / 17.21 max, distinct tokens 2 (just `<unk>` + `</s>`) AFTER this PR: 99.02% `<unk>`, entropy 0.111 bits, distinct=16 Improvement: 100× entropy, 8× distinct token count. The remaining 99% `<unk>` indicates `aprender::text::bpe::BpeTokenizer` itself doesn't fully encode Qwen-format text — likely a missing pretokenizer regex configuration or unk_token-fallback behavior. That's an upstream cascade (separate falsifier-discharge) tracked as PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001. Five-Whys ========== 1. Why ship a partial fix? The dispatch infrastructure is correct and the hex-format path is regression-free. The 100× entropy improvement on byte-level is real progress; the remaining gap is upstream in `aprender::text::bpe`, scoped separately per `feedback_falsifier_first_cascade_pattern.md`. 2. Why try tokenizer.json first when present? It's the canonical HuggingFace format with all metadata (added_tokens, pretokenizer config, normalizer). Some `aprender::text::bpe` paths handle it more completely than the bare vocab.json + merges.txt pair. 3. Why does the hex path stay default? Existing `apr tokenize train` users emit hex-format vocabs; their workflows must remain regression-free. We try hex first, fall through only on the explicit FALSIFY-BPE-FORMAT-MISMATCH-001 signal. 4. Why expose `EncodeTokenizer` as a local enum, not a generic trait? Local scope; only `run_encode_corpus` needs to dispatch. Adding a public trait would expand the API surface for one site. If a third format appears, refactor then. 5. Why not directly fix `aprender::text::bpe::BpeTokenizer` to produce non-`<unk>` output? That's upstream surgery requiring pretokenizer regex implementation + added-token wiring + unk-fallback semantics. Multi-PR scope. This PR ships the smallest-viable dispatch + verifies hex-path is regression- free, so any upstream fix immediately improves byte-level too. Quality gates (all green) ========================== - cargo test -p apr-cli --features training --lib: 5644/5644 PASS - cargo clippy -p apr-cli --features training --lib -- -D warnings: clean - cargo check -p apr-cli --features training: clean - rustfmt --check: clean - LIVE: hex-format encode produces 12.009-bit entropy (was 12.009) - LIVE: byte-level encode produces 0.111-bit entropy (was 0.001 — 100× improvement) SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — but the path forward is STAGED. Next-cycle: fix the upstream encoder gap so byte-level entropy reaches 10+ bits (real Python tokenization), re-tokenize 5g.1, re-dispatch 5g.2. - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end; HONEST verdict still gated on PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001. Out-of-scope follow-ups ======================== PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (multi-PR cascade): - Diagnose why `aprender::text::bpe::BpeTokenizer::encode` produces 99% `<unk>` on Qwen-format vocab even via load_from_json. - Likely: missing pretokenizer regex (GPT-2's complex word-split regex), or mismatched unk-fallback token name. - Fix root cause; verify entropy > 10 bits on 100-doc Python smoke. - Re-tokenize 5g.1 corpus (~17 hours wall on RTX 4090). - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
noahgift
added a commit
that referenced
this pull request
May 10, 2026
…PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) (#1598) ROOT CAUSE pinned + fixed. PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 not yet merged, the hex loader silently succeeded on Qwen-format vocabs and produced 99% `<unk>` (entropy 0.111 bits / 17.21 max). The encoder itself was not the bug. Two new falsifier tests confirm `aprender::text::bpe::BpeTokenizer` works correctly: falsify_bpe_qwen_encode_python_does_not_unk_99pct — load_from_json on real Qwen2 tokenizer.json + encode Python: 0% unk, 43 tokens, 0/43 = 0% (was the predicted 99% RED) falsify_bpe_load_from_files_matches_load_from_json_encode — load_from_files vs load_from_json on same vocab: identical IDs `[750, 75698, 1445, 1648, 198, 220, 220, 220, 470, 308, 198]`, 0/11 unk in both paths Both tests host-gated on Qwen tokenizer.json presence (skip if missing). THE FIX Replace the dependency-on-#1585 dispatch with UPFRONT FORMAT DETECTION. Count canonical hex-byte tokens "00".."ff" in vocab.json directly. - ≥ 200 (legitimate hex vocabs always have all 256) → Hex path - < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged. LIVE EVIDENCE on lambda-vector RTX 4090 100-doc Python smoke from /mnt/.../python-permissive.jsonl: | Vocab format | BEFORE this PR | AFTER this PR | |---|---|---| | Hex (model-2-tokenizer-v1) | 12.009 bits, 13K distinct | 12.009 bits, 13K distinct (regression-free) | | GPT-2 byte-level (Qwen) | 0.111 bits, 16 distinct, 99.02% unk | 6.582 bits, 6118 distinct, 0.00% unk | The Qwen path now correctly produces real Python tokenization. This unblocks the canonical path forward for SHIP-TWO §60: re-tokenize the 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%. Five-Whys 1. Why was PR #1596's dispatch broken? It assumed PR #1585's fail-fast was on main, but #1585 was still OPEN. Hex loader silently accepted Qwen vocab → produced 99% unk → byte-level fallback never fired. 2. Why detect upfront instead of fixing the dependency chain? PR #1585's fail-fast is a load-time signal; this PR's detection is the same logic moved one level up. Now the dispatch works regardless of which path's loader runs first. Cleaner DAG. 3. Why count hex-byte tokens specifically? The presence of all 256 "00".."ff" hex strings is the canonical signature of `apr tokenize train`'s output. Any vocab without them is either GPT-2 byte-level or some other format → byte-level encoder is the correct choice (or refuse if even that fails). 4. Why prefer tokenizer.json when present? It's the canonical HF format with `added_tokens` registered. `load_from_files` on vocab.json+merges.txt also works (verified by upstream-002 test) but tokenizer.json is the higher-fidelity input. 5. Why ship the falsifier tests alongside? They CONFIRM the encoder works correctly when invoked properly. If a future refactor breaks the byte-level path (or the load functions diverge), the tests fail-fast. Drift prevention. Quality gates (all green) - cargo test -p aprender-core --lib falsify_bpe: 2 tests PASS - cargo test -p apr-cli --features training --lib: 5644/5644 PASS - cargo clippy -p apr-cli --features training --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: hex format 12.009 bits (regression-free) - LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk) SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but the path forward is NOW TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix + re-dispatch 5g.2 produces a HONEST val_loss verdict. - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder - This PR closes PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20) - Next ship-mover: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (re-encode 5g.1, re-dispatch 5g.2 LIVE) — operator-dispatchable now. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ROOT CAUSE FOUND for SHIP-TWO §60 val_loss=0.00081 anomaly. The 5g.1 corpus that PR #1580 documented as broken is 99.99%
<unk>tokens (Shannon entropy 0.001 bits). The cause:apr tokenize encode-corpussilently produced<unk>for every byte because aprender-train'sBPETokenizeruses HEX-BYTE format internally (to_bytes("def")→["64", "65", "66"]) but the Qwen vocab fromapr tokenize import-hfis GPT-2 byte-level format (raw chars + Ġ-prefix, NO hex strings).This PR ships the smallest viable fail-fast: load-time format detection in
BPETokenizer::from_vocab_mergesthat refuses to load a non-hex vocab with a clear actionable error. Operators now get a diagnostic instead of a 17-hour broken corpus.The smoking gun
Direct read of
/mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.binfirst 32K tokens:<unk></s>Shannon entropy: 0.001 bits / 17.21 theoretical max. All 228 shards confirmed similarly degenerate. The corpus is essentially one repeated
<unk>token.The mechanism
Five-Whys (full audit trail)
<unk>always; held-out is 99.99%<unk>→ loss ≈ 0.<unk>?encode-corpussilently emitted<unk>for every byte.to_bytesproduces hex strings, Qwen vocab uses GPT-2 byte-level format. Format mismatch.from_vocab_mergesonly checked structural correctness, not format consistency.apr-cli-tokenize-import-hf-v1guarantees byte-correct import;pretokenize-bin-v1guarantees u32 stream output — neither pins "encoder tokenization scheme matches imported vocab tokenization scheme."The fix (smallest viable)
In
BPETokenizer::from_vocab_merges: count canonical hex-byte tokens "00".."ff" in the loaded vocab. If < 200 (legitimate hex vocabs always have all 256), return Err withFALSIFY-BPE-FORMAT-MISMATCH-001citation.Fail-CLOSED design: refusing to encode is better than silently corrupting the corpus.
LIVE evidence
The exact Qwen vocab that produced the broken 5g.1 corpus now fails-fast with the canonical 36/256 hex-byte signature in the diagnostic.
Falsifier test
falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast:BPETokenizer::from_vocab_mergesapr tokenize import-hfRED on main pre-fix; GREEN with this PR.
Updated existing test
test_bpe_from_vocab_merges_rejects_orphan_mergehad a 3-token vocab; the new fail-fast would fire before its orphan-merge check. Updated it to include the 256 hex-byte alphabet so format check passes and orphan-merge check still fires (existing behavior preserved).Test plan
cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier)cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASScargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASScargo clippy -p aprender-train --lib -- -D warnings: cleancargo check --workspace: cleanrustfmt --check: cleanapr tokenize encode-corpuson Qwen vocab fails-fast with clear errorSHIP-TWO impact
Out-of-scope follow-ups (PMAT-CODE-TOKENIZE-BPE-FORMAT-001 cascade)
BPETokenizer(canonical fix; ~150 LOC + tests)Gpt2BpeTokenizerwith format-detection dispatch🤖 Generated with Claude Code