-
Notifications
You must be signed in to change notification settings - Fork 5
Add comprehensive Mistral tokenizer support (v0.8.0) #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add fine-grained control over JIT compilation and regex backend selection: - New jit(bool) method to enable/disable JIT compilation for both backends - Fix pcre2(false) to properly switch back to regexr backend - Track backend state with use_jit and use_pcre2 fields - Preserve JIT preference when cloning tokenizers - Add comprehensive tests for backend switching and JIT control Both regexr and PCRE2 backends now respect JIT preferences, allowing users to disable JIT for debugging or platform compatibility.
Update regexr dependency from v0.1.0-beta.4 to v0.1.0-beta.5, which fixes a critical bug where regex match positions could fall inside multi-byte UTF-8 characters (em-dashes, curly quotes), causing panics when tokenizing with o200k_base and cl100k_base. Add comprehensive regression tests for both vocabularies covering: - Em-dashes and curly quotes at various positions - Batch encoding with multi-byte characters - Parallel batch processing (700+ texts) - Backend consistency between regexr and PCRE2
Change repository and homepage URLs from farhan-syah/splintr to ml-rust/splintr to reflect organizational ownership. Updated in: - Cargo.toml metadata - README.md clone instructions and citation - docs/api_guide.md footer links - docs/special_tokens.md (no URL changes, but staged with related docs)
Implement complete support for Mistral AI tokenizers across three
vocabulary generations, covering all Mistral model families from 7B
to Large 2.
Core Implementation:
- Add SentencePiece mode to Tokenizer for ▁ (U+2581) word boundary handling
- Implement new_sentencepiece() and from_bytes_sentencepiece() constructors
- Add with_full_options() for complete control over tokenizer configuration
- Handle space-to-▁ conversion during encoding and reverse during decoding
- Support duplicate token handling in V2 via explicit decoder
Vocabulary Support:
- mistral_v1 (~32k tokens): Mistral 7B v0.1/v0.2, Mixtral 8x7B
- mistral_v2 (~32,768 tokens): Mistral 7B v0.3, Codestral, Mixtral 8x22B
- mistral_v3 (~131k tokens): Mistral NeMo, Large 2, Pixtral (Tekken)
- All include 54 agent tokens for chat, reasoning, and tool use
Special Token Handling:
- V1: Basic SentencePiece tokens (<unk>, <s>, </s>)
- V2: V1 + control tokens ([INST], [/INST], [TOOL_CALLS], etc.)
- V3: Same control tokens as V2, but Tiktoken-based (not SentencePiece)
Python Bindings:
- Add from_pretrained("mistral_v1"/"mistral_v2"/"mistral_v3")
- Export MISTRAL_V1/V2/V3_AGENT_TOKENS for programmatic access
- Full parity with cl100k_base and llama3 tokenizer APIs
Testing:
- Add comprehensive Python tests for all three Mistral versions
- Add Rust integration tests for V2 edge cases
- Verify encoding/decoding round-trips for all vocabularies
- Test special token handling and agent token IDs
- All 242 tests passing (50 unit + 192 integration tests)
Tooling:
- Add scripts/generate_agent_tokens.py for auto-generating bindings
- Add vocabulary extraction scripts for all Mistral versions
- Eliminate manual synchronization between Rust and Python agent tokens
- Single source of truth in generate_agent_tokens.py
This enables splintr to tokenize for the full Mistral model family
with identical behavior to official Mistral tokenizers, while maintaining
the 10-12x performance advantage over tiktoken.
Update all documentation to include Mistral V1/V2/V3 tokenizer support and usage examples. Updates: - README.md: Add Mistral to supported vocabularies table with model coverage and token counts; update quick start examples; add to compatibility list - docs/api_guide.md: Add from_pretrained() examples for all three Mistral versions with model family annotations - docs/special_tokens.md: Add comprehensive Mistral section covering V1/V2/V3 differences, SentencePiece vs Tiktoken encoding, control tokens, and agent token ID ranges; update all agent token tables with mistral_v1 ID column - python/splintr/__init__.py: Update module docstring with Mistral examples; document MISTRAL_V1/V2/V3_AGENT_TOKENS exports; add usage examples showing token ID access across all models This completes the user-facing documentation for Mistral tokenizer support. Users can now discover and use Mistral tokenizers through the same interface as existing models.
Relocates SENTENCEPIECE_PATTERN from pretrained.rs to tokenizer.rs for better organization and consistency. All regex pattern constants (CL100K_BASE_PATTERN, O200K_BASE_PATTERN, LLAMA3_PATTERN, SENTENCEPIECE_PATTERN) now reside in the same location. Changes: - Add SENTENCEPIECE_PATTERN to src/core/tokenizer.rs with detailed documentation explaining its purpose and how it differs from GPT-style patterns - Export pattern from src/core/mod.rs - Update imports in src/core/pretrained.rs and src/python/bindings.rs This improves code organization by consolidating all tokenization patterns in a single module, making them easier to find and maintain.
Adds conditional compilation support for the python feature, allowing the library to build without Python bindings when the feature is disabled. Changes: - Add #[cfg(feature = "python")] to pymodule attribute and pyo3 imports in src/lib.rs - Remove duplicate pymodule definition from src/python/mod.rs (already defined in src/lib.rs) - Ensure maturin builds always enable python feature via pyproject.toml This fixes build failures when building the library without Python support (e.g., cargo build --no-default-features) while maintaining full compatibility when building with maturin, which automatically enables the python feature.
Updates project URLs in pyproject.toml to reflect the ml-rust GitHub organization, ensuring correct links for homepage, repository, and documentation references.
Mistral V3 (Tekken) was incorrectly using O200K_BASE_PATTERN. The actual
pattern from HuggingFace's Mistral NeMo tokenizer differs:
- No English contraction handling ('s, 't, 're, etc.)
- Single-digit numbers \p{N} instead of \p{N}{1,3}
Changes:
- Add MISTRAL_V3_PATTERN constant matching HuggingFace implementation
- Switch from from_bytes() to from_bytes_byte_level() for proper encoding
- Add native special tokens (<unk>, <s>, </s>) to mistral_v3_special_tokens()
- Update pattern() function to return MISTRAL_V3_PATTERN for MistralV3 variant
- Export MISTRAL_V3_PATTERN from core module
- Update README to document MISTRAL_V3_PATTERN usage
This ensures encoding matches HuggingFace Mistral NeMo tokenizer exactly.
Add integration and unit tests for Mistral V3/Tekken tokenizer: - Vocabulary size verification (131,126 tokens) - Native special tokens (BOS/EOS/UNK) - Agent tokens for conversation, thinking, ReAct, functions, code, RAG - Encoding/decoding roundtrip tests - Comparison with V1/V2 tokenizers - Batch encoding and edge cases Tests verify correct pattern usage and ByteLevel encoding behavior.
Update Python package version to 0.8.0 to match Cargo.toml. Enhance scripts/update_version.sh to update __version__ in python/splintr/__init__.py automatically alongside pyproject.toml, ensuring version consistency across Python distribution files.
Replace hardcoded version numbers with wildcard placeholder to prevent documentation from becoming stale on version updates.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add support for all three Mistral tokenizer generations used across the Mistral model family.
Mistral V1: 32k SentencePiece vocabulary (Mistral 7B v0.1/v0.2, Mixtral 8x7B)
Mistral V2: 32.7k SentencePiece with control tokens (Mistral 7B v0.3, Codestral, Mixtral 8x22B)
Mistral V3: 131k Tekken vocabulary (Mistral NeMo, Large 2, Pixtral)
Implementation
SENTENCEPIECE_PATTERNto tokenizer module)#[cfg(feature = "python")]Testing
All 292 tests passing (50 Rust + 242 Python). Backward compatible API. Maturin builds successful.