Add comprehensive Mistral tokenizer support (v0.8.0) #11

farhan-syah · 2025-12-24T05:13:41Z

Summary

Add support for all three Mistral tokenizer generations used across the Mistral model family.

Mistral V1: 32k SentencePiece vocabulary (Mistral 7B v0.1/v0.2, Mixtral 8x7B)
Mistral V2: 32.7k SentencePiece with control tokens (Mistral 7B v0.3, Codestral, Mixtral 8x22B)
Mistral V3: 131k Tekken vocabulary (Mistral NeMo, Large 2, Pixtral)

Implementation

Added SentencePiece mode with ▁ (U+2581) word boundary handling
Created agent token auto-generation script to eliminate manual sync between Rust/Python
Reorganized pattern constants (moved SENTENCEPIECE_PATTERN to tokenizer module)
Added Python feature gating with #[cfg(feature = "python")]
Extended pretrained vocabulary system for all three Mistral variants

Testing

All 292 tests passing (50 Rust + 242 Python). Backward compatible API. Maturin builds successful.

Add fine-grained control over JIT compilation and regex backend selection: - New jit(bool) method to enable/disable JIT compilation for both backends - Fix pcre2(false) to properly switch back to regexr backend - Track backend state with use_jit and use_pcre2 fields - Preserve JIT preference when cloning tokenizers - Add comprehensive tests for backend switching and JIT control Both regexr and PCRE2 backends now respect JIT preferences, allowing users to disable JIT for debugging or platform compatibility.

Update regexr dependency from v0.1.0-beta.4 to v0.1.0-beta.5, which fixes a critical bug where regex match positions could fall inside multi-byte UTF-8 characters (em-dashes, curly quotes), causing panics when tokenizing with o200k_base and cl100k_base. Add comprehensive regression tests for both vocabularies covering: - Em-dashes and curly quotes at various positions - Batch encoding with multi-byte characters - Parallel batch processing (700+ texts) - Backend consistency between regexr and PCRE2

Change repository and homepage URLs from farhan-syah/splintr to ml-rust/splintr to reflect organizational ownership. Updated in: - Cargo.toml metadata - README.md clone instructions and citation - docs/api_guide.md footer links - docs/special_tokens.md (no URL changes, but staged with related docs)

Implement complete support for Mistral AI tokenizers across three vocabulary generations, covering all Mistral model families from 7B to Large 2. Core Implementation: - Add SentencePiece mode to Tokenizer for ▁ (U+2581) word boundary handling - Implement new_sentencepiece() and from_bytes_sentencepiece() constructors - Add with_full_options() for complete control over tokenizer configuration - Handle space-to-▁ conversion during encoding and reverse during decoding - Support duplicate token handling in V2 via explicit decoder Vocabulary Support: - mistral_v1 (~32k tokens): Mistral 7B v0.1/v0.2, Mixtral 8x7B - mistral_v2 (~32,768 tokens): Mistral 7B v0.3, Codestral, Mixtral 8x22B - mistral_v3 (~131k tokens): Mistral NeMo, Large 2, Pixtral (Tekken) - All include 54 agent tokens for chat, reasoning, and tool use Special Token Handling: - V1: Basic SentencePiece tokens (<unk>, <s>, </s>) - V2: V1 + control tokens ([INST], [/INST], [TOOL_CALLS], etc.) - V3: Same control tokens as V2, but Tiktoken-based (not SentencePiece) Python Bindings: - Add from_pretrained("mistral_v1"/"mistral_v2"/"mistral_v3") - Export MISTRAL_V1/V2/V3_AGENT_TOKENS for programmatic access - Full parity with cl100k_base and llama3 tokenizer APIs Testing: - Add comprehensive Python tests for all three Mistral versions - Add Rust integration tests for V2 edge cases - Verify encoding/decoding round-trips for all vocabularies - Test special token handling and agent token IDs - All 242 tests passing (50 unit + 192 integration tests) Tooling: - Add scripts/generate_agent_tokens.py for auto-generating bindings - Add vocabulary extraction scripts for all Mistral versions - Eliminate manual synchronization between Rust and Python agent tokens - Single source of truth in generate_agent_tokens.py This enables splintr to tokenize for the full Mistral model family with identical behavior to official Mistral tokenizers, while maintaining the 10-12x performance advantage over tiktoken.

Update all documentation to include Mistral V1/V2/V3 tokenizer support and usage examples. Updates: - README.md: Add Mistral to supported vocabularies table with model coverage and token counts; update quick start examples; add to compatibility list - docs/api_guide.md: Add from_pretrained() examples for all three Mistral versions with model family annotations - docs/special_tokens.md: Add comprehensive Mistral section covering V1/V2/V3 differences, SentencePiece vs Tiktoken encoding, control tokens, and agent token ID ranges; update all agent token tables with mistral_v1 ID column - python/splintr/__init__.py: Update module docstring with Mistral examples; document MISTRAL_V1/V2/V3_AGENT_TOKENS exports; add usage examples showing token ID access across all models This completes the user-facing documentation for Mistral tokenizer support. Users can now discover and use Mistral tokenizers through the same interface as existing models.

Relocates SENTENCEPIECE_PATTERN from pretrained.rs to tokenizer.rs for better organization and consistency. All regex pattern constants (CL100K_BASE_PATTERN, O200K_BASE_PATTERN, LLAMA3_PATTERN, SENTENCEPIECE_PATTERN) now reside in the same location. Changes: - Add SENTENCEPIECE_PATTERN to src/core/tokenizer.rs with detailed documentation explaining its purpose and how it differs from GPT-style patterns - Export pattern from src/core/mod.rs - Update imports in src/core/pretrained.rs and src/python/bindings.rs This improves code organization by consolidating all tokenization patterns in a single module, making them easier to find and maintain.

Adds conditional compilation support for the python feature, allowing the library to build without Python bindings when the feature is disabled. Changes: - Add #[cfg(feature = "python")] to pymodule attribute and pyo3 imports in src/lib.rs - Remove duplicate pymodule definition from src/python/mod.rs (already defined in src/lib.rs) - Ensure maturin builds always enable python feature via pyproject.toml This fixes build failures when building the library without Python support (e.g., cargo build --no-default-features) while maintaining full compatibility when building with maturin, which automatically enables the python feature.

Updates project URLs in pyproject.toml to reflect the ml-rust GitHub organization, ensuring correct links for homepage, repository, and documentation references.

Mistral V3 (Tekken) was incorrectly using O200K_BASE_PATTERN. The actual pattern from HuggingFace's Mistral NeMo tokenizer differs: - No English contraction handling ('s, 't, 're, etc.) - Single-digit numbers \p{N} instead of \p{N}{1,3} Changes: - Add MISTRAL_V3_PATTERN constant matching HuggingFace implementation - Switch from from_bytes() to from_bytes_byte_level() for proper encoding - Add native special tokens (<unk>, <s>, </s>) to mistral_v3_special_tokens() - Update pattern() function to return MISTRAL_V3_PATTERN for MistralV3 variant - Export MISTRAL_V3_PATTERN from core module - Update README to document MISTRAL_V3_PATTERN usage This ensures encoding matches HuggingFace Mistral NeMo tokenizer exactly.

Add integration and unit tests for Mistral V3/Tekken tokenizer: - Vocabulary size verification (131,126 tokens) - Native special tokens (BOS/EOS/UNK) - Agent tokens for conversation, thinking, ReAct, functions, code, RAG - Encoding/decoding roundtrip tests - Comparison with V1/V2 tokenizers - Batch encoding and edge cases Tests verify correct pattern usage and ByteLevel encoding behavior.

Update Python package version to 0.8.0 to match Cargo.toml. Enhance scripts/update_version.sh to update __version__ in python/splintr/__init__.py automatically alongside pyproject.toml, ensuring version consistency across Python distribution files.

Replace hardcoded version numbers with wildcard placeholder to prevent documentation from becoming stale on version updates.

farhan-syah added 14 commits December 24, 2025 12:43

chore: bump version to 0.7.0

d086639

chore: update repository URLs to ml-rust organization

04c1032

Updates project URLs in pyproject.toml to reflect the ml-rust GitHub organization, ensuring correct links for homepage, repository, and documentation references.

chore: bump version to 0.8.0

7c7871e

docs: use version wildcard in installation examples

10a11a9

Replace hardcoded version numbers with wildcard placeholder to prevent documentation from becoming stale on version updates.

farhan-syah merged commit 95af0ef into main Dec 24, 2025
5 checks passed

farhan-syah deleted the 0.8.0 branch December 24, 2025 06:22

farhan-syah restored the 0.8.0 branch December 24, 2025 06:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive Mistral tokenizer support (v0.8.0) #11

Add comprehensive Mistral tokenizer support (v0.8.0) #11

Uh oh!

farhan-syah commented Dec 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add comprehensive Mistral tokenizer support (v0.8.0) #11

Add comprehensive Mistral tokenizer support (v0.8.0) #11

Uh oh!

Conversation

farhan-syah commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farhan-syah commented Dec 24, 2025 •

edited

Loading