Skip to content

Conversation

@farhan-syah
Copy link
Collaborator

@farhan-syah farhan-syah commented Dec 24, 2025

Summary

Add support for all three Mistral tokenizer generations used across the Mistral model family.

Mistral V1: 32k SentencePiece vocabulary (Mistral 7B v0.1/v0.2, Mixtral 8x7B)
Mistral V2: 32.7k SentencePiece with control tokens (Mistral 7B v0.3, Codestral, Mixtral 8x22B)
Mistral V3: 131k Tekken vocabulary (Mistral NeMo, Large 2, Pixtral)

Implementation

  • Added SentencePiece mode with ▁ (U+2581) word boundary handling
  • Created agent token auto-generation script to eliminate manual sync between Rust/Python
  • Reorganized pattern constants (moved SENTENCEPIECE_PATTERN to tokenizer module)
  • Added Python feature gating with #[cfg(feature = "python")]
  • Extended pretrained vocabulary system for all three Mistral variants

Testing

All 292 tests passing (50 Rust + 242 Python). Backward compatible API. Maturin builds successful.

Add fine-grained control over JIT compilation and regex backend selection:
- New jit(bool) method to enable/disable JIT compilation for both backends
- Fix pcre2(false) to properly switch back to regexr backend
- Track backend state with use_jit and use_pcre2 fields
- Preserve JIT preference when cloning tokenizers
- Add comprehensive tests for backend switching and JIT control

Both regexr and PCRE2 backends now respect JIT preferences, allowing
users to disable JIT for debugging or platform compatibility.
Update regexr dependency from v0.1.0-beta.4 to v0.1.0-beta.5, which
fixes a critical bug where regex match positions could fall inside
multi-byte UTF-8 characters (em-dashes, curly quotes), causing panics
when tokenizing with o200k_base and cl100k_base.

Add comprehensive regression tests for both vocabularies covering:
- Em-dashes and curly quotes at various positions
- Batch encoding with multi-byte characters
- Parallel batch processing (700+ texts)
- Backend consistency between regexr and PCRE2
Change repository and homepage URLs from farhan-syah/splintr to
ml-rust/splintr to reflect organizational ownership.

Updated in:
- Cargo.toml metadata
- README.md clone instructions and citation
- docs/api_guide.md footer links
- docs/special_tokens.md (no URL changes, but staged with related docs)
Implement complete support for Mistral AI tokenizers across three
vocabulary generations, covering all Mistral model families from 7B
to Large 2.

Core Implementation:
- Add SentencePiece mode to Tokenizer for ▁ (U+2581) word boundary handling
- Implement new_sentencepiece() and from_bytes_sentencepiece() constructors
- Add with_full_options() for complete control over tokenizer configuration
- Handle space-to-▁ conversion during encoding and reverse during decoding
- Support duplicate token handling in V2 via explicit decoder

Vocabulary Support:
- mistral_v1 (~32k tokens): Mistral 7B v0.1/v0.2, Mixtral 8x7B
- mistral_v2 (~32,768 tokens): Mistral 7B v0.3, Codestral, Mixtral 8x22B
- mistral_v3 (~131k tokens): Mistral NeMo, Large 2, Pixtral (Tekken)
- All include 54 agent tokens for chat, reasoning, and tool use

Special Token Handling:
- V1: Basic SentencePiece tokens (<unk>, <s>, </s>)
- V2: V1 + control tokens ([INST], [/INST], [TOOL_CALLS], etc.)
- V3: Same control tokens as V2, but Tiktoken-based (not SentencePiece)

Python Bindings:
- Add from_pretrained("mistral_v1"/"mistral_v2"/"mistral_v3")
- Export MISTRAL_V1/V2/V3_AGENT_TOKENS for programmatic access
- Full parity with cl100k_base and llama3 tokenizer APIs

Testing:
- Add comprehensive Python tests for all three Mistral versions
- Add Rust integration tests for V2 edge cases
- Verify encoding/decoding round-trips for all vocabularies
- Test special token handling and agent token IDs
- All 242 tests passing (50 unit + 192 integration tests)

Tooling:
- Add scripts/generate_agent_tokens.py for auto-generating bindings
- Add vocabulary extraction scripts for all Mistral versions
- Eliminate manual synchronization between Rust and Python agent tokens
- Single source of truth in generate_agent_tokens.py

This enables splintr to tokenize for the full Mistral model family
with identical behavior to official Mistral tokenizers, while maintaining
the 10-12x performance advantage over tiktoken.
Update all documentation to include Mistral V1/V2/V3 tokenizer support
and usage examples.

Updates:
- README.md: Add Mistral to supported vocabularies table with model
  coverage and token counts; update quick start examples; add to
  compatibility list
- docs/api_guide.md: Add from_pretrained() examples for all three
  Mistral versions with model family annotations
- docs/special_tokens.md: Add comprehensive Mistral section covering
  V1/V2/V3 differences, SentencePiece vs Tiktoken encoding, control
  tokens, and agent token ID ranges; update all agent token tables
  with mistral_v1 ID column
- python/splintr/__init__.py: Update module docstring with Mistral
  examples; document MISTRAL_V1/V2/V3_AGENT_TOKENS exports; add usage
  examples showing token ID access across all models

This completes the user-facing documentation for Mistral tokenizer
support. Users can now discover and use Mistral tokenizers through
the same interface as existing models.
Relocates SENTENCEPIECE_PATTERN from pretrained.rs to tokenizer.rs for better
organization and consistency. All regex pattern constants (CL100K_BASE_PATTERN,
O200K_BASE_PATTERN, LLAMA3_PATTERN, SENTENCEPIECE_PATTERN) now reside in the
same location.

Changes:
- Add SENTENCEPIECE_PATTERN to src/core/tokenizer.rs with detailed documentation
  explaining its purpose and how it differs from GPT-style patterns
- Export pattern from src/core/mod.rs
- Update imports in src/core/pretrained.rs and src/python/bindings.rs

This improves code organization by consolidating all tokenization patterns
in a single module, making them easier to find and maintain.
Adds conditional compilation support for the python feature, allowing the
library to build without Python bindings when the feature is disabled.

Changes:
- Add #[cfg(feature = "python")] to pymodule attribute and pyo3 imports in src/lib.rs
- Remove duplicate pymodule definition from src/python/mod.rs (already defined in src/lib.rs)
- Ensure maturin builds always enable python feature via pyproject.toml

This fixes build failures when building the library without Python support
(e.g., cargo build --no-default-features) while maintaining full compatibility
when building with maturin, which automatically enables the python feature.
Updates project URLs in pyproject.toml to reflect the ml-rust GitHub
organization, ensuring correct links for homepage, repository, and
documentation references.
Mistral V3 (Tekken) was incorrectly using O200K_BASE_PATTERN. The actual
pattern from HuggingFace's Mistral NeMo tokenizer differs:
- No English contraction handling ('s, 't, 're, etc.)
- Single-digit numbers \p{N} instead of \p{N}{1,3}

Changes:
- Add MISTRAL_V3_PATTERN constant matching HuggingFace implementation
- Switch from from_bytes() to from_bytes_byte_level() for proper encoding
- Add native special tokens (<unk>, <s>, </s>) to mistral_v3_special_tokens()
- Update pattern() function to return MISTRAL_V3_PATTERN for MistralV3 variant
- Export MISTRAL_V3_PATTERN from core module
- Update README to document MISTRAL_V3_PATTERN usage

This ensures encoding matches HuggingFace Mistral NeMo tokenizer exactly.
Add integration and unit tests for Mistral V3/Tekken tokenizer:
- Vocabulary size verification (131,126 tokens)
- Native special tokens (BOS/EOS/UNK)
- Agent tokens for conversation, thinking, ReAct, functions, code, RAG
- Encoding/decoding roundtrip tests
- Comparison with V1/V2 tokenizers
- Batch encoding and edge cases

Tests verify correct pattern usage and ByteLevel encoding behavior.
Update Python package version to 0.8.0 to match Cargo.toml.

Enhance scripts/update_version.sh to update __version__ in
python/splintr/__init__.py automatically alongside pyproject.toml,
ensuring version consistency across Python distribution files.
Replace hardcoded version numbers with wildcard placeholder to prevent
documentation from becoming stale on version updates.
@farhan-syah farhan-syah merged commit 95af0ef into main Dec 24, 2025
5 checks passed
@farhan-syah farhan-syah deleted the 0.8.0 branch December 24, 2025 06:22
@farhan-syah farhan-syah restored the 0.8.0 branch December 24, 2025 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants