Skip to content

fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001)#1585

Merged
noahgift merged 24 commits into
mainfrom
feat/falsify-bpe-hex-vs-gpt2-format-mismatch
May 13, 2026
Merged

fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001)#1585
noahgift merged 24 commits into
mainfrom
feat/falsify-bpe-hex-vs-gpt2-format-mismatch

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 9, 2026

Summary

ROOT CAUSE FOUND for SHIP-TWO §60 val_loss=0.00081 anomaly. The 5g.1 corpus that PR #1580 documented as broken is 99.99% <unk> tokens (Shannon entropy 0.001 bits). The cause: apr tokenize encode-corpus silently produced <unk> for every byte because aprender-train's BPETokenizer uses HEX-BYTE format internally (to_bytes("def")["64", "65", "66"]) but the Qwen vocab from apr tokenize import-hf is GPT-2 byte-level format (raw chars + Ġ-prefix, NO hex strings).

This PR ships the smallest viable fail-fast: load-time format detection in BPETokenizer::from_vocab_merges that refuses to load a non-hex vocab with a clear actionable error. Operators now get a diagnostic instead of a 17-hour broken corpus.

The smoking gun

Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin first 32K tokens:

Token ID Count % Identity
128244 32830 99.99% <unk>
128247 2 0.01% </s>

Shannon entropy: 0.001 bits / 17.21 theoretical max. All 228 shards confirmed similarly degenerate. The corpus is essentially one repeated <unk> token.

The mechanism

apr tokenize import-hf (Qwen tokenizer.json)
  → vocab.json with GPT-2 byte-level format ("Ġdef", "def", raw chars)

apr tokenize encode-corpus (loads vocab via BPETokenizer::from_vocab_merges)
  → load succeeds silently
  → encoding "def" calls to_bytes() → ["64", "65", "66"] (HEX strings)
  → vocab.get("64") returns None (vocab has "d" not "64")
  → fallback to unk_id
  → ALL tokens become <unk>

Five-Whys (full audit trail)

  1. Why is val_loss=0.00081 (PR docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) #1580)? Trained model just learned to predict <unk> always; held-out is 99.99% <unk> → loss ≈ 0.
  2. Why is corpus 99.99% <unk>? encode-corpus silently emitted <unk> for every byte.
  3. Why couldn't it find anything? to_bytes produces hex strings, Qwen vocab uses GPT-2 byte-level format. Format mismatch.
  4. Why did load succeed silently? from_vocab_merges only checked structural correctness, not format consistency.
  5. Why didn't existing falsifiers catch this? Between-contracts gap: apr-cli-tokenize-import-hf-v1 guarantees byte-correct import; pretokenize-bin-v1 guarantees u32 stream output — neither pins "encoder tokenization scheme matches imported vocab tokenization scheme."

The fix (smallest viable)

In BPETokenizer::from_vocab_merges: count canonical hex-byte tokens "00".."ff" in the loaded vocab. If < 200 (legitimate hex vocabs always have all 256), return Err with FALSIFY-BPE-FORMAT-MISMATCH-001 citation.

Fail-CLOSED design: refusing to encode is better than silently corrupting the corpus.

LIVE evidence

$ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ...

error: Validation failed: Cannot load tokenizer: Serialization error:
FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at /tmp/qwen-0.5b-tokenizer-extracted/vocab.json
contains only 36/256 canonical hex-byte tokens ("00".."ff"), below the 200
threshold. aprender-train's BPETokenizer uses HEX-BYTE format internally
(to_bytes emits "64" for byte 'd', etc.); loading a HuggingFace GPT-2
byte-level vocab (e.g., from `apr tokenize import-hf` of Qwen2/Llama2/Mistral,
which use Ġ-prefix + raw chars) would silently produce 99.99% `<unk>`
tokens during encode (root cause of SHIP-TWO §60 val_loss=0.00081 anomaly)...

The exact Qwen vocab that produced the broken 5g.1 corpus now fails-fast with the canonical 36/256 hex-byte signature in the diagnostic.

Falsifier test

falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast:

  • Synthesizes a GPT-2-style vocab.json on disk (no hex bytes)
  • Calls BPETokenizer::from_vocab_merges
  • Asserts: Err returned, message cites falsifier id, mentions "hex-byte" + names apr tokenize import-hf

RED on main pre-fix; GREEN with this PR.

Updated existing test

test_bpe_from_vocab_merges_rejects_orphan_merge had a 3-token vocab; the new fail-fast would fire before its orphan-merge check. Updated it to include the 256 hex-byte alphabet so format check passes and orphan-merge check still fires (existing behavior preserved).

Test plan

  • cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier)
  • cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS
  • cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS
  • cargo clippy -p aprender-train --lib -- -D warnings: clean
  • cargo check --workspace: clean
  • rustfmt --check: clean
  • LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with clear error

SHIP-TWO impact

  • MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
  • MODEL-2 ship %: unchanged at 57% — but path forward is unblocked. Re-tokenize 5g.1 with a working encoder + re-dispatch 5g.2 produces HONEST val_loss. The corpus that produced the §60 anomaly is now CONFIRMED INVALID (not a code bug in eval, but a data bug in encode).
  • §50.4 cascade: COMPLETE per feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577. This bug is upstream in tokenization.
  • 17 hours of compute: would have been saved by this fail-fast.

Out-of-scope follow-ups (PMAT-CODE-TOKENIZE-BPE-FORMAT-001 cascade)

  • Implement Ġ-prefix byte-level encoding in BPETokenizer (canonical fix; ~150 LOC + tests)
  • OR add parallel Gpt2BpeTokenizer with format-detection dispatch
  • Re-tokenize 5g.1 corpus with working encoder; verify Shannon entropy > 10 bits
  • Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%

🤖 Generated with Claude Code

…PMAT-CODE-TOKENIZE-BPE-FORMAT-001)

Closes the silent-`<unk>` defect class that produced SHIP-TWO §60's
val_loss=0.00081 anomaly recorded in PR #1580.

ROOT CAUSE
==========

aprender-train's `BPETokenizer::to_bytes` (line 117) emits HEX-string
representations: byte 'd' (0x64) → "64", byte 'e' → "65", etc. The
loaded vocab.json must have these hex strings as keys for encoding
to work.

`apr tokenize import-hf` (used by SHIP-TWO §54-§56 step 5g.0 to
extract Qwen2.5-Coder-0.5B-Instruct's tokenizer) emits HuggingFace
GPT-2 byte-level format: tokens like "Ġdef", "Ġreturn", "def" with
Ġ-prefix for spaces and raw characters. **NO hex strings.**

When `apr tokenize encode-corpus` then loaded this vocab via
`from_vocab_merges`, the load succeeded silently. Subsequent
encoding pipeline:
  1. `to_bytes("def")` → ["64", "65", "66"] (hex)
  2. `apply_merges` looks up these in Qwen vocab — never found
  3. `vocab.get("64")` returns None
  4. Fallback to `unk_id` (line 275)
  5. ALL bytes become `<unk>`

Empirical verification (this branch, lambda-vector RTX 4090):
  - Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin
  - First 32K tokens (= 16 batches × 4 sequences × 513 tokens):
      99.99% token 128244 (`<unk>`)
      0.01% token 128247 (`</s>`)
      Shannon entropy: 0.001 bits / 17.21 bits theoretical max
  - All 228 shards confirmed similarly degenerate (~0.003 bits each)

Five-Whys
=========

1. Why was val_loss=0.00081 implausibly low (PR #1580)? Because the
   trained model just learned to predict `<unk>` always — and the
   held-out batches were 99.99% `<unk>`. cross-entropy on
   monotonous labels ≈ 0.
2. Why is the corpus 99.99% `<unk>`? Because `apr tokenize
   encode-corpus` silently emitted `<unk>` for every byte it
   couldn't find in the loaded vocab.
3. Why couldn't it find anything? Because `to_bytes` produces hex
   strings ("64") but the Qwen vocab uses GPT-2 byte-level format
   (raw chars + Ġ-prefix). Format mismatch.
4. Why did the load succeed silently? Because `from_vocab_merges`
   only checked structural correctness (every merged token in
   vocab) but NOT format consistency. The vocab format matters
   because `to_bytes`'s output must match vocab keys.
5. Why didn't existing falsifiers catch this? Because they're
   between-contracts: `apr-cli-tokenize-import-hf-v1` guarantees
   import is byte-correct; `pretokenize-bin-v1` guarantees output
   is u32 stream — but neither pins "encoder's tokenization scheme
   matches imported vocab's tokenization scheme." Closing that
   gap with this PR's fail-fast.

FIX (smallest viable, fail-fast)
=================================

In `BPETokenizer::from_vocab_merges`, after loading vocab.json,
count how many of the canonical 256 hex-byte tokens "00".."ff"
exist in the vocab. A legitimate hex-byte vocab from `apr tokenize
train` always has all 256 (allocated during `init_vocab`). If
fewer than 200 are present, the vocab is in the wrong format
and the loader returns Err with FALSIFY-BPE-FORMAT-MISMATCH-001
citation, naming the cause and pointing to the canonical fix
(implement Ġ-prefix encoding in a follow-up).

This is a fail-CLOSED guard: silently corrupting a corpus is
worse than refusing to run. The operator now sees a clear actionable
error instead of producing a 17-hour broken corpus.

LIVE EVIDENCE
=============

  $ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ...

  error: Validation failed: Cannot load tokenizer: Serialization error:
  FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at
  /tmp/qwen-0.5b-tokenizer-extracted/vocab.json contains only 36/256
  canonical hex-byte tokens ("00".."ff"), below the 200 threshold.
  aprender-train's BPETokenizer uses HEX-BYTE format internally...

The exact Qwen vocab that produced the broken 5g.1 corpus now
fails-fast on the canonical 36/256 hex-byte signature.

Falsifier test
==============

`falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast`:
  - Synthesizes a tiny GPT-2-style vocab.json (raw chars + Ġ-prefix,
    NO hex bytes) on disk
  - Calls `BPETokenizer::from_vocab_merges`
  - Asserts:
    - result is Err
    - error message cites "FALSIFY-BPE-FORMAT-MISMATCH-001"
    - error message mentions "hex-byte" format
    - error message names `apr tokenize import-hf` (operator
      diagnostic clarity)

RED on main pre-fix; GREEN with this PR.

Updated existing test
=====================

`test_bpe_from_vocab_merges_rejects_orphan_merge` was implicitly
relying on a 3-token vocab; the new fail-fast fires before its
orphan-merge check. Updated the test's vocab to include the 256
hex-byte alphabet so the format check passes and the orphan-merge
check still fires (existing behavior preserved).

Quality gates (all green)
==========================

- cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier)
- cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS
- cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS
- cargo clippy -p aprender-train --lib -- -D warnings: clean
- cargo check --workspace: clean
- rustfmt --check: clean
- LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with
  clear error (verified on lambda-vector RTX 4090)

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% — but the path forward is now
  unblocked. The 5g.1 corpus is INVALID (99.99% `<unk>`); a fix
  for PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (Ġ-prefix encoding) would
  let `apr tokenize encode-corpus` produce a real Python corpus,
  and re-running 5g.1 + 5g.2 would produce HONEST val_loss
  numbers in the plausible 1.5-2.5 range.
- §50.4 cascade: COMPLETE per #1577. The bug surfaced here is
  upstream in tokenization, not in any §50.4 step.
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) but the
  CORRECT-DATA path requires PMAT-CODE-TOKENIZE-BPE-FORMAT-001
  to land first.

Out-of-scope follow-ups
========================

PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (multi-PR cascade):
  - Implement Ġ-prefix byte-level encoding in `BPETokenizer` (the
    canonical fix; ~150 LOC + tests).
  - OR add a parallel `Gpt2BpeTokenizer` that aprender-train's
    encode-corpus dispatches to based on vocab format detection.
  - Re-tokenize the 5g.1 corpus with the working encoder; verify
    Shannon entropy > 10 bits.
  - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip
    MODEL-2 ship % 57% → ≥58%.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 9, 2026 12:41
noahgift added a commit that referenced this pull request May 9, 2026
…ODE-TOKENIZE-BPE-FORMAT-001) (#1596)

Builds on PR #1585's fail-fast load-time format detection. When
`apr tokenize encode-corpus` receives a vocab in GPT-2 byte-level
format (i.e., from `apr tokenize import-hf` of Qwen2/Llama2/Mistral)
that fails the hex-byte loader with FALSIFY-BPE-FORMAT-MISMATCH-001,
this PR routes through `aprender::text::bpe::BpeTokenizer` (the
proper byte-level encoder) instead of returning the fail-fast
error.

Three-way load priority:
  1. Hex-byte loader (BPETokenizer::from_vocab_merges) — for
     vocabs trained by `apr tokenize train` (legacy 50257-vocab
     codeparrot path).
  2. tokenizer.json (aprender::text::bpe::load_from_json) — when
     a sibling tokenizer.json exists in the dir, prefer the
     canonical HuggingFace format.
  3. vocab.json + merges.txt (aprender::text::bpe::load_from_files)
     — fallback when only the import-hf-extracted pair exists.

LIVE EVIDENCE (lambda-vector RTX 4090, 100-doc Python smoke)
=============================================================

  Hex-format vocab (model-2-tokenizer-v1, vocab=50257):
    UNCHANGED — entropy 12.009 bits, 13304 distinct tokens.
    Confirms regression-free for the legacy 5g.1-pre path.

  GPT-2 byte-level vocab (Qwen2.5-Coder, vocab=151643):
    BEFORE this PR: 99.99% `<unk>`, entropy 0.001 bits / 17.21 max,
                    distinct tokens 2 (just `<unk>` + `</s>`)
    AFTER  this PR: 99.02% `<unk>`, entropy 0.111 bits, distinct=16
    Improvement: 100× entropy, 8× distinct token count.

The remaining 99% `<unk>` indicates `aprender::text::bpe::BpeTokenizer`
itself doesn't fully encode Qwen-format text — likely a missing
pretokenizer regex configuration or unk_token-fallback behavior.
That's an upstream cascade (separate falsifier-discharge) tracked
as PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001.

Five-Whys
==========

1. Why ship a partial fix? The dispatch infrastructure is correct
   and the hex-format path is regression-free. The 100× entropy
   improvement on byte-level is real progress; the remaining gap
   is upstream in `aprender::text::bpe`, scoped separately per
   `feedback_falsifier_first_cascade_pattern.md`.
2. Why try tokenizer.json first when present? It's the canonical
   HuggingFace format with all metadata (added_tokens, pretokenizer
   config, normalizer). Some `aprender::text::bpe` paths handle it
   more completely than the bare vocab.json + merges.txt pair.
3. Why does the hex path stay default? Existing `apr tokenize
   train` users emit hex-format vocabs; their workflows must
   remain regression-free. We try hex first, fall through only on
   the explicit FALSIFY-BPE-FORMAT-MISMATCH-001 signal.
4. Why expose `EncodeTokenizer` as a local enum, not a generic
   trait? Local scope; only `run_encode_corpus` needs to dispatch.
   Adding a public trait would expand the API surface for one
   site. If a third format appears, refactor then.
5. Why not directly fix `aprender::text::bpe::BpeTokenizer` to
   produce non-`<unk>` output? That's upstream surgery requiring
   pretokenizer regex implementation + added-token wiring +
   unk-fallback semantics. Multi-PR scope. This PR ships the
   smallest-viable dispatch + verifies hex-path is regression-
   free, so any upstream fix immediately improves byte-level too.

Quality gates (all green)
==========================

- cargo test -p apr-cli --features training --lib: 5644/5644 PASS
- cargo clippy -p apr-cli --features training --lib -- -D warnings: clean
- cargo check -p apr-cli --features training: clean
- rustfmt --check: clean
- LIVE: hex-format encode produces 12.009-bit entropy (was 12.009)
- LIVE: byte-level encode produces 0.111-bit entropy (was 0.001 — 100× improvement)

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% — but the path forward is
  STAGED. Next-cycle: fix the upstream encoder gap so byte-level
  entropy reaches 10+ bits (real Python tokenization), re-tokenize
  5g.1, re-dispatch 5g.2.
- §50.4 cascade: COMPLETE per #1577
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end; HONEST verdict
  still gated on PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001.

Out-of-scope follow-ups
========================

PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (multi-PR cascade):
  - Diagnose why `aprender::text::bpe::BpeTokenizer::encode` produces
    99% `<unk>` on Qwen-format vocab even via load_from_json.
  - Likely: missing pretokenizer regex (GPT-2's complex word-split
    regex), or mismatched unk-fallback token name.
  - Fix root cause; verify entropy > 10 bits on 100-doc Python smoke.
  - Re-tokenize 5g.1 corpus (~17 hours wall on RTX 4090).
  - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip
    MODEL-2 ship % 57% → ≥58%.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 10, 2026
…PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001) (#1598)

ROOT CAUSE pinned + fixed.

PR #1596 shipped a "try hex first, fall through on FALSIFY-001"
strategy that depended on PR #1585's load-time fail-fast. With #1585
not yet merged, the hex loader silently succeeded on Qwen-format
vocabs and produced 99% `<unk>` (entropy 0.111 bits / 17.21 max).

The encoder itself was not the bug. Two new falsifier tests confirm
`aprender::text::bpe::BpeTokenizer` works correctly:

  falsify_bpe_qwen_encode_python_does_not_unk_99pct
    — load_from_json on real Qwen2 tokenizer.json + encode Python:
      0% unk, 43 tokens, 0/43 = 0% (was the predicted 99% RED)
  falsify_bpe_load_from_files_matches_load_from_json_encode
    — load_from_files vs load_from_json on same vocab:
      identical IDs `[750, 75698, 1445, 1648, 198, 220, 220, 220, 470, 308, 198]`,
      0/11 unk in both paths

Both tests host-gated on Qwen tokenizer.json presence (skip if missing).

THE FIX

Replace the dependency-on-#1585 dispatch with UPFRONT FORMAT DETECTION.
Count canonical hex-byte tokens "00".."ff" in vocab.json directly.
- ≥ 200 (legitimate hex vocabs always have all 256) → Hex path
- < 200 (HF GPT-2 byte-level vocabs have ~36) → ByteLevel path

Detection runs against vocab.json content, independent of any
loader's behavior. Works whether or not PR #1585 has merged.

LIVE EVIDENCE on lambda-vector RTX 4090

100-doc Python smoke from /mnt/.../python-permissive.jsonl:

| Vocab format | BEFORE this PR | AFTER this PR |
|---|---|---|
| Hex (model-2-tokenizer-v1) | 12.009 bits, 13K distinct | 12.009 bits, 13K distinct (regression-free) |
| GPT-2 byte-level (Qwen) | 0.111 bits, 16 distinct, 99.02% unk | 6.582 bits, 6118 distinct, 0.00% unk |

The Qwen path now correctly produces real Python tokenization. This
unblocks the canonical path forward for SHIP-TWO §60: re-tokenize
the 5g.1 corpus → re-dispatch 5g.2 → honest val_loss → flip
MODEL-2 ship % 57% → ≥58%.

Five-Whys

1. Why was PR #1596's dispatch broken? It assumed PR #1585's
   fail-fast was on main, but #1585 was still OPEN. Hex loader
   silently accepted Qwen vocab → produced 99% unk → byte-level
   fallback never fired.
2. Why detect upfront instead of fixing the dependency chain?
   PR #1585's fail-fast is a load-time signal; this PR's detection
   is the same logic moved one level up. Now the dispatch works
   regardless of which path's loader runs first. Cleaner DAG.
3. Why count hex-byte tokens specifically? The presence of all 256
   "00".."ff" hex strings is the canonical signature of `apr
   tokenize train`'s output. Any vocab without them is either GPT-2
   byte-level or some other format → byte-level encoder is the
   correct choice (or refuse if even that fails).
4. Why prefer tokenizer.json when present? It's the canonical HF
   format with `added_tokens` registered. `load_from_files` on
   vocab.json+merges.txt also works (verified by upstream-002 test)
   but tokenizer.json is the higher-fidelity input.
5. Why ship the falsifier tests alongside? They CONFIRM the
   encoder works correctly when invoked properly. If a future
   refactor breaks the byte-level path (or the load functions
   diverge), the tests fail-fast. Drift prevention.

Quality gates (all green)

- cargo test -p aprender-core --lib falsify_bpe: 2 tests PASS
- cargo test -p apr-cli --features training --lib: 5644/5644 PASS
- cargo clippy -p apr-cli --features training --lib -- -D warnings: clean
- cargo check --workspace: clean
- rustfmt --check: clean
- LIVE: hex format 12.009 bits (regression-free)
- LIVE: byte-level format 6.582 bits, 0% unk (was 0.111 / 99% unk)

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91%
- MODEL-2 ship %: unchanged at 57% — but the path forward is NOW
  TECHNICALLY UNBLOCKED. Re-tokenize 5g.1 corpus with this fix +
  re-dispatch 5g.2 produces a HONEST val_loss verdict.
- §50.4 cascade: COMPLETE per #1577
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end with WORKING encoder
- This PR closes PMAT-CODE-TOKENIZE-BPE-UPSTREAM-001 (task #20)
- Next ship-mover: PMAT-CODE-PRETRAIN-FINETUNE-LIVE-003 (re-encode
  5g.1, re-dispatch 5g.2 LIVE) — operator-dispatchable now.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added 20 commits May 12, 2026 09:55
@noahgift noahgift merged commit 4d46ae5 into main May 13, 2026
10 checks passed
@noahgift noahgift deleted the feat/falsify-bpe-hex-vs-gpt2-format-mismatch branch May 13, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant