fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) by noahgift · Pull Request #1585 · paiml/aprender

noahgift · 2026-05-09T12:41:25Z

Summary

ROOT CAUSE FOUND for SHIP-TWO §60 val_loss=0.00081 anomaly. The 5g.1 corpus that PR #1580 documented as broken is 99.99% <unk> tokens (Shannon entropy 0.001 bits). The cause: apr tokenize encode-corpus silently produced <unk> for every byte because aprender-train's BPETokenizer uses HEX-BYTE format internally (to_bytes("def") → ["64", "65", "66"]) but the Qwen vocab from apr tokenize import-hf is GPT-2 byte-level format (raw chars + Ġ-prefix, NO hex strings).

This PR ships the smallest viable fail-fast: load-time format detection in BPETokenizer::from_vocab_merges that refuses to load a non-hex vocab with a clear actionable error. Operators now get a diagnostic instead of a 17-hour broken corpus.

The smoking gun

Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin first 32K tokens:

Token ID	Count	%	Identity
128244	32830	99.99%	`<unk>`
128247	2	0.01%	`</s>`

Shannon entropy: 0.001 bits / 17.21 theoretical max. All 228 shards confirmed similarly degenerate. The corpus is essentially one repeated <unk> token.

The mechanism

apr tokenize import-hf (Qwen tokenizer.json)
  → vocab.json with GPT-2 byte-level format ("Ġdef", "def", raw chars)

apr tokenize encode-corpus (loads vocab via BPETokenizer::from_vocab_merges)
  → load succeeds silently
  → encoding "def" calls to_bytes() → ["64", "65", "66"] (HEX strings)
  → vocab.get("64") returns None (vocab has "d" not "64")
  → fallback to unk_id
  → ALL tokens become <unk>

Five-Whys (full audit trail)

Why is val_loss=0.00081 (PR docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) #1580)? Trained model just learned to predict <unk> always; held-out is 99.99% <unk> → loss ≈ 0.
Why is corpus 99.99% <unk>? encode-corpus silently emitted <unk> for every byte.
Why couldn't it find anything? to_bytes produces hex strings, Qwen vocab uses GPT-2 byte-level format. Format mismatch.
Why did load succeed silently? from_vocab_merges only checked structural correctness, not format consistency.
Why didn't existing falsifiers catch this? Between-contracts gap: apr-cli-tokenize-import-hf-v1 guarantees byte-correct import; pretokenize-bin-v1 guarantees u32 stream output — neither pins "encoder tokenization scheme matches imported vocab tokenization scheme."

The fix (smallest viable)

In BPETokenizer::from_vocab_merges: count canonical hex-byte tokens "00".."ff" in the loaded vocab. If < 200 (legitimate hex vocabs always have all 256), return Err with FALSIFY-BPE-FORMAT-MISMATCH-001 citation.

Fail-CLOSED design: refusing to encode is better than silently corrupting the corpus.

LIVE evidence

$ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ...

error: Validation failed: Cannot load tokenizer: Serialization error:
FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at /tmp/qwen-0.5b-tokenizer-extracted/vocab.json
contains only 36/256 canonical hex-byte tokens ("00".."ff"), below the 200
threshold. aprender-train's BPETokenizer uses HEX-BYTE format internally
(to_bytes emits "64" for byte 'd', etc.); loading a HuggingFace GPT-2
byte-level vocab (e.g., from `apr tokenize import-hf` of Qwen2/Llama2/Mistral,
which use Ġ-prefix + raw chars) would silently produce 99.99% `<unk>`
tokens during encode (root cause of SHIP-TWO §60 val_loss=0.00081 anomaly)...

The exact Qwen vocab that produced the broken 5g.1 corpus now fails-fast with the canonical 36/256 hex-byte signature in the diagnostic.

Falsifier test

falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast:

Synthesizes a GPT-2-style vocab.json on disk (no hex bytes)
Calls BPETokenizer::from_vocab_merges
Asserts: Err returned, message cites falsifier id, mentions "hex-byte" + names apr tokenize import-hf

RED on main pre-fix; GREEN with this PR.

Updated existing test

test_bpe_from_vocab_merges_rejects_orphan_merge had a 3-token vocab; the new fail-fast would fire before its orphan-merge check. Updated it to include the 256 hex-byte alphabet so format check passes and orphan-merge check still fires (existing behavior preserved).

Test plan

cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier)
cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS
cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS
cargo clippy -p aprender-train --lib -- -D warnings: clean
cargo check --workspace: clean
rustfmt --check: clean
LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with clear error

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
MODEL-2 ship %: unchanged at 57% — but path forward is unblocked. Re-tokenize 5g.1 with a working encoder + re-dispatch 5g.2 produces HONEST val_loss. The corpus that produced the §60 anomaly is now CONFIRMED INVALID (not a code bug in eval, but a data bug in encode).
§50.4 cascade: COMPLETE per feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577. This bug is upstream in tokenization.
17 hours of compute: would have been saved by this fail-fast.

Out-of-scope follow-ups (PMAT-CODE-TOKENIZE-BPE-FORMAT-001 cascade)

Implement Ġ-prefix byte-level encoding in BPETokenizer (canonical fix; ~150 LOC + tests)
OR add parallel Gpt2BpeTokenizer with format-detection dispatch
Re-tokenize 5g.1 corpus with working encoder; verify Shannon entropy > 10 bits
Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%

🤖 Generated with Claude Code

…PMAT-CODE-TOKENIZE-BPE-FORMAT-001) Closes the silent-`<unk>` defect class that produced SHIP-TWO §60's val_loss=0.00081 anomaly recorded in PR #1580. ROOT CAUSE ========== aprender-train's `BPETokenizer::to_bytes` (line 117) emits HEX-string representations: byte 'd' (0x64) → "64", byte 'e' → "65", etc. The loaded vocab.json must have these hex strings as keys for encoding to work. `apr tokenize import-hf` (used by SHIP-TWO §54-§56 step 5g.0 to extract Qwen2.5-Coder-0.5B-Instruct's tokenizer) emits HuggingFace GPT-2 byte-level format: tokens like "Ġdef", "Ġreturn", "def" with Ġ-prefix for spaces and raw characters. **NO hex strings.** When `apr tokenize encode-corpus` then loaded this vocab via `from_vocab_merges`, the load succeeded silently. Subsequent encoding pipeline: 1. `to_bytes("def")` → ["64", "65", "66"] (hex) 2. `apply_merges` looks up these in Qwen vocab — never found 3. `vocab.get("64")` returns None 4. Fallback to `unk_id` (line 275) 5. ALL bytes become `<unk>` Empirical verification (this branch, lambda-vector RTX 4090): - Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin - First 32K tokens (= 16 batches × 4 sequences × 513 tokens): 99.99% token 128244 (`<unk>`) 0.01% token 128247 (`</s>`) Shannon entropy: 0.001 bits / 17.21 bits theoretical max - All 228 shards confirmed similarly degenerate (~0.003 bits each) Five-Whys ========= 1. Why was val_loss=0.00081 implausibly low (PR #1580)? Because the trained model just learned to predict `<unk>` always — and the held-out batches were 99.99% `<unk>`. cross-entropy on monotonous labels ≈ 0. 2. Why is the corpus 99.99% `<unk>`? Because `apr tokenize encode-corpus` silently emitted `<unk>` for every byte it couldn't find in the loaded vocab. 3. Why couldn't it find anything? Because `to_bytes` produces hex strings ("64") but the Qwen vocab uses GPT-2 byte-level format (raw chars + Ġ-prefix). Format mismatch. 4. Why did the load succeed silently? Because `from_vocab_merges` only checked structural correctness (every merged token in vocab) but NOT format consistency. The vocab format matters because `to_bytes`'s output must match vocab keys. 5. Why didn't existing falsifiers catch this? Because they're between-contracts: `apr-cli-tokenize-import-hf-v1` guarantees import is byte-correct; `pretokenize-bin-v1` guarantees output is u32 stream — but neither pins "encoder's tokenization scheme matches imported vocab's tokenization scheme." Closing that gap with this PR's fail-fast. FIX (smallest viable, fail-fast) ================================= In `BPETokenizer::from_vocab_merges`, after loading vocab.json, count how many of the canonical 256 hex-byte tokens "00".."ff" exist in the vocab. A legitimate hex-byte vocab from `apr tokenize train` always has all 256 (allocated during `init_vocab`). If fewer than 200 are present, the vocab is in the wrong format and the loader returns Err with FALSIFY-BPE-FORMAT-MISMATCH-001 citation, naming the cause and pointing to the canonical fix (implement Ġ-prefix encoding in a follow-up). This is a fail-CLOSED guard: silently corrupting a corpus is worse than refusing to run. The operator now sees a clear actionable error instead of producing a 17-hour broken corpus. LIVE EVIDENCE ============= $ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ... error: Validation failed: Cannot load tokenizer: Serialization error: FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at /tmp/qwen-0.5b-tokenizer-extracted/vocab.json contains only 36/256 canonical hex-byte tokens ("00".."ff"), below the 200 threshold. aprender-train's BPETokenizer uses HEX-BYTE format internally... The exact Qwen vocab that produced the broken 5g.1 corpus now fails-fast on the canonical 36/256 hex-byte signature. Falsifier test ============== `falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast`: - Synthesizes a tiny GPT-2-style vocab.json (raw chars + Ġ-prefix, NO hex bytes) on disk - Calls `BPETokenizer::from_vocab_merges` - Asserts: - result is Err - error message cites "FALSIFY-BPE-FORMAT-MISMATCH-001" - error message mentions "hex-byte" format - error message names `apr tokenize import-hf` (operator diagnostic clarity) RED on main pre-fix; GREEN with this PR. Updated existing test ===================== `test_bpe_from_vocab_merges_rejects_orphan_merge` was implicitly relying on a 3-token vocab; the new fail-fast fires before its orphan-merge check. Updated the test's vocab to include the 256 hex-byte alphabet so the format check passes and the orphan-merge check still fires (existing behavior preserved). Quality gates (all green) ========================== - cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier) - cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS - cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with clear error (verified on lambda-vector RTX 4090) SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — but the path forward is now unblocked. The 5g.1 corpus is INVALID (99.99% `<unk>`); a fix for PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (Ġ-prefix encoding) would let `apr tokenize encode-corpus` produce a real Python corpus, and re-running 5g.1 + 5g.2 would produce HONEST val_loss numbers in the plausible 1.5-2.5 range. - §50.4 cascade: COMPLETE per #1577. The bug surfaced here is upstream in tokenization, not in any §50.4 step. - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) but the CORRECT-DATA path requires PMAT-CODE-TOKENIZE-BPE-FORMAT-001 to land first. Out-of-scope follow-ups ======================== PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (multi-PR cascade): - Implement Ġ-prefix byte-level encoding in `BPETokenizer` (the canonical fix; ~150 LOC + tests). - OR add a parallel `Gpt2BpeTokenizer` that aprender-train's encode-corpus dispatches to based on vocab format detection. - Re-tokenize the 5g.1 corpus with the working encoder; verify Shannon entropy > 10 bits. - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ODE-TOKENIZE-BPE-FORMAT-001) (#1596) Builds on PR #1585's fail-fast load-time format detection. When `apr tokenize encode-corpus` receives a vocab in GPT-2 byte-level format (i.e., from `apr tokenize import-hf` of Qwen2/Llama2/Mistral) that fails the hex-byte loader with FALSIFY-BPE-FORMAT-MISMATCH-001, this PR routes through `aprender::text::bpe::BpeTokenizer` (the proper byte-level encoder) instead of returning the fail-fast error. Three-way load priority: 1. Hex-byte loader (BPETokenizer::from_vocab_merges) — for vocabs trained by `apr tokenize train` (legacy 50257-vocab codeparrot path). 2. tokenizer.json (aprender::text::bpe::load_from_json) — when a sibling tokenizer.json exists in the dir, prefer the canonical HuggingFace format. 3. vocab.json + merges.txt (aprender::text::bpe::load_from_files) — fallback when only the import-hf-extracted pair exists. LIVE EVIDENCE (lambda-vector RTX 4090, 100-doc Python smoke) ============================================================= Hex-format vocab (model-2-tokenizer-v1, vocab=50257): UNCHANGED — entropy 12.009 bits, 13304 distinct tokens. Confirms regression-free for the legacy 5g.1-pre path. GPT-2 byte-level vocab (Qwen2.5-Coder, vocab=151643): BEFORE this PR: 99.99% `<unk>`, entropy 0.001 bits / 17.21 max, distinct tokens 2 (just `<unk>` + `</s>`) AFTER this PR: 99.02% `<unk>`, entropy 0.111 bits, distinct=16 Improvement: 100× entropy, 8× distinct token count. The remaining 99% `<unk>` indicates `aprender::text::bpe::BpeTokenizer` itself doesn't fully encode Qwen-format text — likely a missing pretokenizer regex configuration or unk_token-fallback behavior. That's an upstream cascade (separate falsifier-discharge) tracked as PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001. Five-Whys ========== 1. Why ship a partial fix? The dispatch infrastructure is correct and the hex-format path is regression-free. The 100× entropy improvement on byte-level is real progress; the remaining gap is upstream in `aprender::text::bpe`, scoped separately per `feedback_falsifier_first_cascade_pattern.md`. 2. Why try tokenizer.json first when present? It's the canonical HuggingFace format with all metadata (added_tokens, pretokenizer config, normalizer). Some `aprender::text::bpe` paths handle it more completely than the bare vocab.json + merges.txt pair. 3. Why does the hex path stay default? Existing `apr tokenize train` users emit hex-format vocabs; their workflows must remain regression-free. We try hex first, fall through only on the explicit FALSIFY-BPE-FORMAT-MISMATCH-001 signal. 4. Why expose `EncodeTokenizer` as a local enum, not a generic trait? Local scope; only `run_encode_corpus` needs to dispatch. Adding a public trait would expand the API surface for one site. If a third format appears, refactor then. 5. Why not directly fix `aprender::text::bpe::BpeTokenizer` to produce non-`<unk>` output? That's upstream surgery requiring pretokenizer regex implementation + added-token wiring + unk-fallback semantics. Multi-PR scope. This PR ships the smallest-viable dispatch + verifies hex-path is regression- free, so any upstream fix immediately improves byte-level too. Quality gates (all green) ========================== - cargo test -p apr-cli --features training --lib: 5644/5644 PASS - cargo clippy -p apr-cli --features training --lib -- -D warnings: clean - cargo check -p apr-cli --features training: clean - rustfmt --check: clean - LIVE: hex-format encode produces 12.009-bit entropy (was 12.009) - LIVE: byte-level encode produces 0.111-bit entropy (was 0.001 — 100× improvement) SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — but the path forward is STAGED. Next-cycle: fix the upstream encoder gap so byte-level entropy reaches 10+ bits (real Python tokenization), re-tokenize 5g.1, re-dispatch 5g.2. - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end; HONEST verdict still gated on PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001. Out-of-scope follow-ups ======================== PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (multi-PR cascade): - Diagnose why `aprender::text::bpe::BpeTokenizer::encode` produces 99% `<unk>` on Qwen-format vocab even via load_from_json. - Likely: missing pretokenizer regex (GPT-2's complex word-split regex), or mismatched unk-fallback token name. - Fix root cause; verify entropy > 10 bits on 100-doc Python smoke. - Re-tokenize 5g.1 corpus (~17 hours wall on RTX 4090). - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) (#1598) ROOT CAUSE pinned + fixed. PR #1596 shipped a "try hex first, fall through on FALSIFY-001" strategy that depended on PR #1585's load-time fail-fast. With #1585 not yet merged, the hex loader silently succeeded on Qwen-format vocabs and produced 99% `<unk>` (entropy 0.111 bits / 17.21 max). The encoder itself was not the bug. Two new falsifier tests confirm `aprender::text::bpe::BpeTokenizer` works correctly: falsify_bpe_qwen_encode_python_does_not_unk_99pct — load_from_json on real Qwen2 tokenizer.json + encode Python: 0% unk, 43 tokens, 0/43 = 0% (was the predicted 99% RED) falsify_bpe_load_from_files_matches_load_from_json_encode — load_from_files vs load_from_json on same vocab: identical IDs `[750, 75698, 1445, 1648, 198, 220, 220, 220, 470, 308, 198]`, 0/11 unk in both paths Both tests host-gated on Qwen tokenizer.json presence (skip if missing). THE FIX Replace the dependency-on-#1585 dispatch with UPFRONT FORMAT DETECTION. Count canonical hex-byte tokens "00".."ff" in vocab.json directly. - ≥ 200 (legitimate hex vocabs always have all 256) → Hex path - < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path Detection runs against vocab.json content, independent of any loader's behavior. Works whether or not PR #1585 has merged. LIVE EVIDENCE on lambda-vector RTX 4090 100-doc Python smoke from /mnt/.../python-permissive.jsonl: | Vocab format | BEFORE this PR | AFTER this PR | |---|---|---| | Hex (model-2-tokenizer-v1) | 12.009 bits, 13K distinct | 12.009 bits, 13K distinct (regression-free) | | GPT-2 byte-level (Qwen) | 0.111 bits, 16 distinct, 99.02% unk | 6.582 bits, 6118 distinct, 0.00% unk | The Qwen path now correctly produces real Python tokenization. This unblocks the canonical path forward for SHIP-TWO §60: re-tokenize the 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip MODEL-2 ship % 57% → ≥58%. Five-Whys 1. Why was PR #1596's dispatch broken? It assumed PR #1585's fail-fast was on main, but #1585 was still OPEN. Hex loader silently accepted Qwen vocab → produced 99% unk → byte-level fallback never fired. 2. Why detect upfront instead of fixing the dependency chain? PR #1585's fail-fast is a load-time signal; this PR's detection is the same logic moved one level up. Now the dispatch works regardless of which path's loader runs first. Cleaner DAG. 3. Why count hex-byte tokens specifically? The presence of all 256 "00".."ff" hex strings is the canonical signature of `apr tokenize train`'s output. Any vocab without them is either GPT-2 byte-level or some other format → byte-level encoder is the correct choice (or refuse if even that fails). 4. Why prefer tokenizer.json when present? It's the canonical HF format with `added_tokens` registered. `load_from_files` on vocab.json+merges.txt also works (verified by upstream-002 test) but tokenizer.json is the higher-fidelity input. 5. Why ship the falsifier tests alongside? They CONFIRM the encoder works correctly when invoked properly. If a future refactor breaks the byte-level path (or the load functions diverge), the tests fail-fast. Drift prevention. Quality gates (all green) - cargo test -p aprender-core --lib falsify_bpe: 2 tests PASS - cargo test -p apr-cli --features training --lib: 5644/5644 PASS - cargo clippy -p apr-cli --features training --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: hex format 12.009 bits (regression-free) - LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk) SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but the path forward is NOW TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix + re-dispatch 5g.2 produces a HONEST val_loss verdict. - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder - This PR closes PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20) - Next ship-mover: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (re-encode 5g.1, re-dispatch 5g.2 LIVE) — operator-dispatchable now. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 9, 2026 12:41

noahgift mentioned this pull request May 9, 2026

feat(apr-cli): two-format tokenizer dispatch in encode-corpus (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1596

Merged

6 tasks

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

3fc02bd

noahgift mentioned this pull request May 9, 2026

fix(apr-cli): upfront vocab-format detection unblocks Qwen encoding (PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) #1598

Merged

7 tasks

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

d496b4e

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

4247b2f

noahgift mentioned this pull request May 11, 2026

feat(apr-cli): #1547 — apr tokenize encode-corpus --estimate-only pre-flight pass #1553

Merged

6 tasks

noahgift added 20 commits May 12, 2026 09:55

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

81989db

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

d96a9f7

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

b9cb3c5

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

9a37355

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

dd228e9

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

87df032

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

b8a08e3

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

77966c0

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

774cf99

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

9d4a752

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

35fb561

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

72b9cc4

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

6d83615

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

8da036e

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

a7904a1

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

6a17926

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

b2bc00e

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

dccbbe6

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

042b867

Merge branch 'main' into feat/falsify-bpe-hex-vs-gpt2-format-mismatch

e86704f

noahgift merged commit 4d46ae5 into main May 13, 2026
10 checks passed

noahgift deleted the feat/falsify-bpe-hex-vs-gpt2-format-mismatch branch May 13, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001)#1585

fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001)#1585
noahgift merged 24 commits into
mainfrom
feat/falsify-bpe-hex-vs-gpt2-format-mismatch

noahgift commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 9, 2026

Summary

The smoking gun

The mechanism

Five-Whys (full audit trail)

The fix (smallest viable)

LIVE evidence

Falsifier test

Updated existing test

Test plan

SHIP-TWO impact

Out-of-scope follow-ups (PMAT-CODE-TOKENIZE-BPE-FORMAT-001 cascade)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant